There have been significant developments regarding Mamba (SSM) architectures in recent days. In summary:
-
Carnegie Mellon, Princeton, Cartesia AI and Together AI have released Mamba‑3, a new SSM that prioritizes inference efficiency and expressive state evolution rather than pure training‑speed wins like Mamba‑2. Mamba‑3 introduces a more general recurrence via an exponential‑trapezoidal discretization scheme, complex‑valued SSM states, and MIMO (multi‑input, multi‑output) SSMs to boost accuracy without slowing decoding, and it already outperforms Mamba‑2 and strong linear‑attention baselines on language‑modeling tasks.
-
Nemotron‑3 Super uses a hybrid Mamba‑Transformer MoE stack: Mamba‑style State‑Space‑Model (SSM) layers are interleaved with Transformer blocks to combine linear‑time sequence processing with strong long‑range attention and in‑context reasoning.
-
Across several labs and industry partners, the near‑term roadmap emphasizes larger Mamba‑scale models (10B–100B parameters), better Mamba‑Transformer hybrids tuned per domain, and edge‑ready deployments—all leveraging the fact that Mamba‑based SSMs can process long sequences at roughly linear time and lower memory cost than vanilla Transformers.
The core concept of Autoresearch gained widespread acceptance very quickly, becoming one of the pioneering ideas for taking the first step toward self-improvement, even Karpathy offered such an entry in his repo:
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026."
The result is not a giant training stack or an overengineered platform. It is a tight experimental loop that an agent can actually drive.
What I wanted was the same basic setup, but for the family of Mamba (State Space Models) architectures.
Why Mamba
Still, most of the early autoresearch experiments naturally centered on GPT / nanoGPT architectures. But Mamba is interesting for a different set of reasons.
State space models have become especially compelling in long-sequence settings, inference-sensitive workloads and places where sequence efficiency matters more than simply reusing the same transformer recipe forever.
If the point of autoresearch is to test whether an agent can improve a model recipe under a fixed budget, then it makes sense to ask the same question on top of Mamba as well.
Can an agent iteratively improve a Mamba training recipe under the same hard constraints Karpathy used for GPT?
That is what this repo is for.
What The Repo Actually Is
autoresearch-mamba is deliberately small. The structure mirrors the same discipline that makes Karpathy's setup useful:
prepare_mlx.py: fixed data prep, tokenizer, dataloader, andval_bpbevaluator for the MLX pathtrain_mamba_mlx.py: the single main file the agent is supposed to mutate during MLX autoresearchprogram.md: instructions for the agent loopprepare.pyandtrain_mamba.py: a secondary PyTorch/CUDA pathanalysis.ipynb: a notebook for analyzingresults.tsvand plotting progress over time
The canonical path is Apple Silicon + MLX. The CUDA/PyTorch path exists as a secondary reference implementation.
Architecturally, the current target is Mamba-2-style, not Mamba-3. The point here was not to recreate every upstream kernel path or every fused implementation detail in mamba_ssm.
The point was to build a compact, agent-friendly research harness that preserves the core Mamba block logic while staying small enough to iterate on.
The Autoresearch Loop
The loop is intentionally simple:
- Prepare data and tokenizer once.
- Run training for a fixed
TIME_BUDGET = 300seconds. - Evaluate using
val_bpb. - Keep the change if
val_bpbimproved. - Otherwise discard it and move on.
That five-minute budget matters. It forces the question to become: what is the best model and training recipe an agent can discover on this hardware in this exact amount of time?
It also keeps the search honest. The agent cannot win by quietly changing the evaluator, expanding the benchmark, or redefining success after the fact. It has one target: lower val_bpb under the fixed harness.
Why val_bpb
The metric here is validation bits per byte (val_bpb). Lower is better.
I like this choice for the same reason Karpathy did: it stays tied to language modeling performance without becoming too tokenization-specific. It is not a GPT metric. It is not a transformer metric. It works just as well for a Mamba autoregressive language model.
One subtle but important point: in a fixed five-minute autoresearch loop, the final val_bpb is not just about the model. It is also about how much optimization progress your hardware can fit into the time budget. So comparisons are meaningful when the setup is held constant: same evaluator, same preset, same platform class.
MLX, GPU, And The Local Preset
One of the practical constraints here was that I wanted this to work not only as a theoretical repo, but as something I could actually test on Apple Silicon.
So the repo has two operating modes:
- the tracked full MLX baseline, intended as the main setup
- an optional local MLX preset for smaller-memory local testing
That local preset turned out to be important. It let me validate the loop end-to-end on constrained Apple Silicon while keeping the full baseline intact as the canonical tracked configuration.
In other words: the repo is not just "MLX-compatible" in theory. It is set up so the autoresearch loop can actually run locally, collect results.tsv, and generate a progress plot without changing the underlying philosophy.
What I Wanted To Preserve From Karpathy's Repo
There are a lot of ways to turn "autonomous research" into fluff. The thing I wanted to preserve from Karpathy's setup was not the vibe. It was the discipline:
- one fixed evaluator
- one main mutable training file
- one fixed time budget
- one ground-truth metric
- one keep/discard loop
That design is what makes the repo useful. Without those constraints, you do not really have autoresearch. You just have an agent editing files until the story sounds good.
What Ships Today
The release includes:
- the MLX Mamba autoresearch path
- the secondary PyTorch/CUDA reference path
- a documented README and
program.md - a local MLX preset for smaller Apple Silicon testing
- an analysis notebook for plotting progress from
results.tsv
The repo is public here:
What Comes Next
The current repo is focused on the Mamba-2-style target.
The next obvious directions are:
- a Mamba-3 implementation
- hybrid Mamba-Transformer MoE variants
- longer real autoresearch runs on stronger hardware
But the important part is already in place: the loop itself. A compact, testable autoresearch harness for Mamba, with MLX as a first-class path instead of an afterthought.
That was the goal.
