

dtthinky!
Hello, I usually go by the "dt.thinky!" alias, though my name is .
Currently, I'm working on megakernels, optimizers, and kernel optimizations.
You may enjoy reading some of the articles I've written.
My notes, experiments and things I find worth sharing:
On this website:
Research!
View allA running log of posts, experiments, and longer notes from the work.
Verified Replay Distillation (VRD) recipe for continual learning in verifiable domains
autoresearch-mamba: Karpathy-Style Autoresearch for Mamba-2, Mamba-3, and Hybrid Mamba-Transformer MoE
Mem-RLM — Memory-Augmented Inference for Recursive Language Models
On Compression, Computation and the Space Between
Defeating Nondeterminism in LLM Inference: Reproducing Batch-Invariant Ops (RMSNorm & Tiled Matrix Multiplication) in JAX
Streaming deepagents and task delegation with real-time output
Energetics of Allosteric Communication in Ubiquitin Revealed by Hybrid MCTS-Langevin Simulations
Neural Networks & New Kinds!
Compression is how I think about learning. The tighter a model can compress its inputs, the more structure it has actually found. Kolmogorov complexity makes this precise — it measures the length of the shortest program that produces a given output, which turns out to be the theoretical floor for any compressor.
The ultimate compressor
K(X) = length of the shortest program that outputs X
For any computable compressor C and all strings X:
K(X) ≤ |C(X)| + K(C) + O(1)
via the simulation argument — run C inside a universal machine
The catch
K(X) is uncomputable — you can never know the true shortest program.
But a deep network is a finite parallel computer that approximates it with bounded resources.
MAGICAL!
Why neural nets are compressors
Neural nets can simulate arbitrary programs
↓
They are small computers — circuits wired by data
↓
SGD searches over the space of programs they can express
Micro-Kolmogorov complexity
Fix an architecture, then fit a network with SGD — the bit-length of the resulting weights is a practical proxy for description length:
micro-K(f) ≈ bit-length of weights in a fixed architecture
minf ∈ F [ loss(f) + λ · micro-K(f) ]
Shorter description length → better generalization.
I'm deeply invested in methods that make learning systems compress harder and generalize further.





