

Dogukan Tuna - Personal Website
Hello, I usually go by the "dt.thinky!" alias, though my name is Doğukan.
I'm working on generalization, super-efficient learning algorithms, and pretraining autoresearch. Ambitiously, I'm conducting research on a new, entirely CUDA-based training and auto-research engine called Axion, which takes a non-traditional approach to the scaling recipe toward superhuman generalization. Trying to maximize my net positive impact on shortening the timeline to achieving superhuman capabilities.
You may enjoy reading some of the articles I've written.
My notes, experiments and things I find worth sharing:
On this website:
Research!
View allA running log of posts, experiments, and longer notes from the work.
Verified Replay Distillation (VRD) recipe for continual learning in verifiable domains
autoresearch-mamba: Karpathy-Style Autoresearch for Mamba-2, Mamba-3, and Hybrid Mamba-Transformer MoE
Mem-RLM — Memory-Augmented Inference for Recursive Language Models
On Compression, Computation and the Space Between
Neural Networks & New Kinds!
Compression is how I think about learning. The tighter a model can compress its inputs, the more structure it has actually found. Kolmogorov complexity makes this precise — it measures the length of the shortest program that produces a given output, which turns out to be the theoretical floor for any compressor.
The ultimate compressor
K(X) = length of the shortest program that outputs X
For any computable compressor C and all strings X:
K(X) ≤ |C(X)| + K(C) + O(1)
via the simulation argument — run C inside a universal machine
The catch
K(X) is uncomputable — you can never know the true shortest program.
But a deep network is a finite parallel computer that approximates it with bounded resources.
MAGICAL!
Why neural nets are compressors
Neural nets can simulate arbitrary programs
↓
They are small computers — circuits wired by data
↓
SGD searches over the space of programs they can express
Micro-Kolmogorov complexity
Fix an architecture, then fit a network with SGD — the bit-length of the resulting weights is a practical proxy for description length:
micro-K(f) ≈ bit-length of weights in a fixed architecture
minf ∈ F [ loss(f) + λ · micro-K(f) ]
Shorter description length → better generalization.
I'm deeply invested in methods that make learning systems compress harder and generalize further.


