Back

A self-improving skill catalog for AI agents

·6 min read
A self-improving skill catalog for AI agents
X (Twitter)

If you find this useful, you can follow me on X

@dtthinky · Generally, tinkering with optimal generalization, AI learning efficiency and post-transformer architectures.

Follow

The catalog started as a simple observation: agents don't know specialized compute frameworks. They hallucinate APIs, miss version-specific changes, and guess at patterns that require deep domain knowledge. So I had my agents build themselves a skill catalog — structured markdown files that encode operational knowledge for every major accelerated compute framework I work with.

What I didn't expect: the catalog became self-improving. The same agents that consume the skills also generate new ones. When I needed a skill for Prime Intellect's verifiers, the agent cloned the repo, read the AGENTS.md, analyzed the architecture, and wrote the skill. Same for Tinker. Same for autoresearch-setup. The agents don't just use the catalog — they grow it.

How It Works

Each skill is a directory with a SKILL.md (under 500 lines) and a references/ folder for deep-dive content. The format follows the agentskills.io specification — compatible with Claude Code, OpenAI Codex, and any agent runtime that reads markdown.

skills/
  cuda-quantum/
    SKILL.md                      # Quick reference, model tables, problem mapping
    references/
      kernels-and-gates.md        # Kernel syntax, qubit allocation, gate operations
      noise-modeling.md           # Noise channels, Kraus operators, density matrix sim
      hardware-backends.md        # QPU targets, IonQ/IQM/OQC/Quantinuum configs
      ...8 files total

  prime-verifiers/
    SKILL.md                      # RL environments, rubrics, reward functions
    references/

  tinker/
    SKILL.md                      # Low-level LLM training API
    references/

  autoresearch-setup/
    SKILL.md                      # Autonomous experiment loop scaffolding

An agent reads SKILL.md first, then drills into whichever reference file matches the task. The skills aren't summaries — they're the operational knowledge a senior engineer would pass to a new hire: current function signatures, right patterns, common pitfalls, which primitives to reach for.

The Self-Improving Loop

Here's how a new skill gets created:

  1. I tell the agent "create a skill for X"
  2. The agent clones the repo, reads the docs (and any existing AGENTS.md or CLAUDE.md in the project)
  3. It analyzes the architecture, identifies key concepts, maps out the API surface
  4. It generates a SKILL.md following the same structure as every other skill
  5. It installs the skill in the agent's own skill directories, updates indexes, commits, and pushes

The newest three skills — autoresearch-setup, prime-verifiers, and tinker — were all created this way in a single session. The agent researched each project, understood the philosophy, wrote the skills, and deployed them. Then it updated this blog post to document what it built.

This is the loop: agents consume skills to be competent, then create new skills to make future agents more competent. The catalog compounds.

What's Inside

19 skills across 7 domains:

Autonomous Research:

LLM Post-Training & Fine-Tuning:

  • Prime Intellect Verifiers & PrimeRL: RL environments, rubrics, and reward functions for LLM post-training. Environment class hierarchy from SingleTurnEnv to SandboxEnv. GEPA genetic-Pareto prompt optimization. Async distributed RL training at 1000+ GPUs with GRPO/PPO/RLOO.
  • Tinker (Thinking Machines Lab): Low-level training API — four primitives (forward_backward, optim_step, save, sample) give full algorithmic control over remote GPU clusters. LoRA up to 235B-param models. 15+ recipes: math RL, code RL, multiplayer RL, prompt distillation, DPO, RLHF.

Quantum Computing:

  • CUDA Quantum: Kernel syntax, noise channels, QPU backend configs (IonQ, Quantinuum, IQM, OQC), multi-GPU distribution.
  • Qiskit: Runtime primitives, transpiler optimization, VQE/QAOA, Qiskit Metal hardware design, error mitigation — 15 reference files, the deepest skill.

Physics & Scientific Computing:

  • PhysicsNeMo: 18 model architectures, torch.compile integration with external kernels, Curator ETL pipelines.
  • ALCHEMI Toolkit-Ops: GPU-accelerated neighbor lists, Ewald/PME electrostatics, DFT-D3 dispersion.
  • Materials Project: MPRester API, crystal structures, phase diagrams, pymatgen integration.
  • RDKit: Molecular I/O, fingerprints, substructure search, 3D conformers, scaffold analysis.

Computational Biology:

  • BioNeMo: ESM-2, Evo2 (40kb genomic context), Geneformer single-cell models, NIM microservices.

GPU Programming & Kernels:

  • Triton: Block-based GPU kernels, Flash Attention v2, persistent kernels, FP8/FP4, auto-tuning, CUDA/HIP/CPU backends.
  • FlashInfer: Paged KV-cache attention, cascade inference (31x speedup), MoE fusion, vLLM/SGLang integration.

LLM & Generative AI:

  • Nemotron: Hybrid Mamba-Transformer MoE (up to 253B), 1M context, NIM deployment, synthetic data generation.
  • Omniverse SimReady: OpenUSD physics simulation, Isaac Sim robotics, semantic labeling.

HPC & Distributed Computing:

  • SLURM: Job scheduling, GPU/GRES allocation, job arrays, monitoring.
  • Dask: Parallel DataFrames/Arrays, GPU acceleration with RAPIDS, HPC deployment.
  • Ray: Distributed tasks/actors, Ray Train/Serve/Tune/RLlib, KubeRay.
  • MPI: Point-to-point, collectives, MPI-IO, GPU-aware MPI.

Data Processing:

  • Polars: Lazy query optimization, GPU acceleration via cuDF, expressions API.

Why Markdown

Skills are just markdown files. No vector database. No embedding pipeline. No inference-time retrieval system. An agent loads the file, reads it, and has the knowledge. The "retraining" cost when a framework ships a breaking change is editing a text file.

The 500-line limit on SKILL.md is the constraint that matters. You can't dump every API — you have to decide what's important, what the right default is, what the agent should reach for first. The references/ directory holds depth. The skill file holds judgment.

Where This Is Going

The catalog started with 4 NVIDIA-specific skills. It's at 19 now, with 108 reference files, covering domains from quantum circuits to LLM post-training to autonomous research loops.

The scope keeps expanding because the creation cost is low. An agent that already has the catalog can create a new skill in minutes — clone a repo, read the docs, write the skill, install it, push. The catalog is the flywheel: more skills → more capable agents → faster skill creation → more skills.

What's next: every major framework, SDK, engine, or simulator with accelerated compute backends. LLM training stacks, inference engines, autonomous research workflows, quantum compilers, molecular dynamics simulators, climate models, genomics pipelines. If it has a nontrivial API surface and runs on specialized hardware, it's a candidate.

The long-term aim is a comprehensive, self-maintaining reference that lets any agent — mine or yours — walk into any specialized compute domain and have the operational knowledge to be useful immediately.

19 skills. 108 reference files. Built by agents, for agents. Self-improving.