Back

autoresearch-mamba

·7 min read
autoresearch-mamba
X (Twitter)

If you find this useful, you can follow me on X

@dtthinky · Generally, tinkering with optimal generalization, AI learning efficiency and post-transformer architectures.

Follow

There have been significant developments regarding Mamba (SSM) architectures in recent days. In summary:

  • Carnegie Mellon, Princeton, Cartesia AI and Together AI have released Mamba‑3, a new SSM that prioritizes inference efficiency and expressive state evolution rather than pure training‑speed wins like Mamba‑2. Mamba‑3 introduces a more general recurrence via an exponential‑trapezoidal discretization scheme, complex‑valued SSM states, and MIMO (multi‑input, multi‑output) SSMs to boost accuracy without slowing decoding, and it already outperforms Mamba‑2 and strong linear‑attention baselines on language‑modeling tasks.

  • Nemotron‑3 Super uses a hybrid Mamba‑Transformer MoE stack: Mamba‑style State‑Space‑Model (SSM) layers are interleaved with Transformer blocks to combine linear‑time sequence processing with strong long‑range attention and in‑context reasoning.

  • Across several labs and industry partners, the near‑term roadmap emphasizes larger Mamba‑scale models (10B–100B parameters), better Mamba‑Transformer hybrids tuned per domain, and edge‑ready deployments—all leveraging the fact that Mamba‑based SSMs can process long sequences at roughly linear time and lower memory cost than vanilla Transformers.

The core concept of Autoresearch gained widespread acceptance very quickly, becoming one of the pioneering ideas for taking the first step toward self-improvement, even Karpathy offered such an entry in his repo:

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026."

The result is not a giant training stack or an overengineered platform. It is a tight experimental loop that an agent can actually drive.

What I wanted was the same basic setup, but for the family of Mamba (State Space Models) architectures.

Why Mamba

Still, most of the early autoresearch experiments naturally centered on GPT / nanoGPT architectures. But Mamba is interesting for a different set of reasons.

State space models have become especially compelling in long-sequence settings, inference-sensitive workloads and places where sequence efficiency matters more than simply reusing the same transformer recipe forever.

If the point of autoresearch is to test whether an agent can improve a model recipe under a fixed budget, then it makes sense to ask the same question on top of Mamba as well.

Can an agent iteratively improve a Mamba training recipe under the same hard constraints Karpathy used for GPT?

That is what this repo is for.

What The Repo Actually Is

autoresearch-mamba is deliberately small. The structure mirrors the same discipline that makes Karpathy's setup useful:

  • prepare_mlx.py: fixed data prep, tokenizer, dataloader, and val_bpb evaluator for the MLX path
  • train_mamba_mlx.py: the single main file the agent is supposed to mutate during MLX autoresearch
  • program.md: instructions for the agent loop
  • prepare.py and train_mamba.py: a secondary PyTorch/CUDA path
  • analysis.ipynb: a notebook for analyzing results.tsv and plotting progress over time

The canonical path is Apple Silicon + MLX. The CUDA/PyTorch path exists as a secondary reference implementation.

Architecturally, the current target is Mamba-2-style, not Mamba-3. The point here was not to recreate every upstream kernel path or every fused implementation detail in mamba_ssm.

The point was to build a compact, agent-friendly research harness that preserves the core Mamba block logic while staying small enough to iterate on.

The Autoresearch Loop

The loop is intentionally simple:

  1. Prepare data and tokenizer once.
  2. Run training for a fixed TIME_BUDGET = 300 seconds.
  3. Evaluate using val_bpb.
  4. Keep the change if val_bpb improved.
  5. Otherwise discard it and move on.

That five-minute budget matters. It forces the question to become: what is the best model and training recipe an agent can discover on this hardware in this exact amount of time?

It also keeps the search honest. The agent cannot win by quietly changing the evaluator, expanding the benchmark, or redefining success after the fact. It has one target: lower val_bpb under the fixed harness.

Why val_bpb

The metric here is validation bits per byte (val_bpb). Lower is better.

I like this choice for the same reason Karpathy did: it stays tied to language modeling performance without becoming too tokenization-specific. It is not a GPT metric. It is not a transformer metric. It works just as well for a Mamba autoregressive language model.

One subtle but important point: in a fixed five-minute autoresearch loop, the final val_bpb is not just about the model. It is also about how much optimization progress your hardware can fit into the time budget. So comparisons are meaningful when the setup is held constant: same evaluator, same preset, same platform class.

MLX, GPU, And The Local Preset

One of the practical constraints here was that I wanted this to work not only as a theoretical repo, but as something I could actually test on Apple Silicon.

So the repo has two operating modes:

  • the tracked full MLX baseline, intended as the main setup
  • an optional local MLX preset for smaller-memory local testing

That local preset turned out to be important. It let me validate the loop end-to-end on constrained Apple Silicon while keeping the full baseline intact as the canonical tracked configuration.

In other words: the repo is not just "MLX-compatible" in theory. It is set up so the autoresearch loop can actually run locally, collect results.tsv, and generate a progress plot without changing the underlying philosophy.

What I Wanted To Preserve From Karpathy's Repo

There are a lot of ways to turn "autonomous research" into fluff. The thing I wanted to preserve from Karpathy's setup was not the vibe. It was the discipline:

  • one fixed evaluator
  • one main mutable training file
  • one fixed time budget
  • one ground-truth metric
  • one keep/discard loop

That design is what makes the repo useful. Without those constraints, you do not really have autoresearch. You just have an agent editing files until the story sounds good.

What Ships Today

The release includes:

  • the MLX Mamba autoresearch path
  • the secondary PyTorch/CUDA reference path
  • a documented README and program.md
  • a local MLX preset for smaller Apple Silicon testing
  • an analysis notebook for plotting progress from results.tsv

The repo is public here:

What Comes Next

The current repo is focused on the Mamba-2-style target.

The next obvious directions are:

  • a Mamba-3 implementation
  • hybrid Mamba-Transformer MoE variants
  • longer real autoresearch runs on stronger hardware

But the important part is already in place: the loop itself. A compact, testable autoresearch harness for Mamba, with MLX as a first-class path instead of an afterthought.

That was the goal.