I started agent-optimized skill catalog for accelerated compute and kernels (GPU/TPU/QPU)

Agents are not much proficient about the specialized frameworks, SDKs, and simulators that run on accelerated hardware — the GPU-optimized physics engines, QPU-targeted quantum compilers, protein language model training stacks, and atomistic simulation kernels that power serious scientific and engineering compute. These are the kinds of things you only find in official docs that change every release, scattered across examples that assume you already know the framework.

I wanted my agents to be competent in these domains. Not "can probably guess the API" competent — actually knowing the current function signatures, the right patterns, the common pitfalls, which model to pick for which problem. Whether it's configuring a PhysicsNeMo Curator ETL pipeline, choosing between Ewald and PME for electrostatics, or setting up multi-QPU distribution with CUDA Quantum.

The obvious approach is fine-tuning or RAG over raw docs. Both have problems. Fine-tuning is expensive, static, and doesn't update when a new release ships. RAG over raw docs gives you chunks of HTML with no structure, no prioritization, and no opinion about what matters. You get recall without understanding.

So I built something simpler: agent-skills-for-compute.

The Structure

Each skill is a directory with a SKILL.md (under 500 lines) and a references/ folder with deep-dive files on specific topics. The format follows the agentskills.io specification from Callstack's open-source project.

skills/
  cuda-quantum/
    SKILL.md                      # Quick reference, model tables, problem mapping
    references/
      kernels-and-gates.md        # Kernel syntax, qubit allocation, gate operations
      noise-modeling.md           # Noise channels, Kraus operators, density matrix sim
      hardware-backends.md        # QPU targets, IonQ/IQM/OQC/Quantinuum configs
      variational-algorithms.md   # VQE, QAOA, parameter optimization
      ...8 files total

  physicsnemo/
    SKILL.md
    references/
      models-and-architectures.md # 18+ architectures: GNNs, transformers, neural operators
      advanced-features.md        # torch.compile + external kernels, DGL→PyG migration
      data-pipelines.md           # PhysicsNeMo Curator ETL, HDF5/VTK/Zarr
      ...7 files total

  bionemo/
    SKILL.md
    references/
      esm2-protein-language.md    # Pre-training, fine-tuning, LoRA, checkpoints
      evo2-genomic-model.md       # 40kb context DNA model, variant effects
      geneformer-single-cell.md   # Cell type classification, GRN inference
      ...9 files total

  alchemi-toolkit-ops/
    SKILL.md
    references/
      neighbor-lists.md           # Auto-dispatch, cell list, rebuild detection
      electrostatics.md           # Ewald, PME, Coulomb, multipole, autograd
      dispersion-corrections.md   # DFT-D3(BJ), coordination numbers, 4-pass kernels
      math-and-utilities.md       # GTO, spherical harmonics, batch processing

Each SKILL.md has the same bones: an overview, a quick pattern (wrong way vs right way), a quick command for installation, a reference table of all APIs, a priority-ordered guide to the reference files, and a problem-to-skill mapping. The idea is that an agent reads SKILL.md first, then drills into whichever reference file matches the task.

What's Inside

The skills aren't summaries. They're operational knowledge — the kind of thing a senior engineer who's spent days with the framework would tell a new hire.

Concrete examples:

PhysicsNeMo: 18 model architectures with when-to-use guidance. X-MeshGraphNet for 100M+ cell meshes. Transolver with Transformer Engine fp8 on Hopper GPUs. DoMINO NIM inference via physicsnemo.cfd.inference.domino_nim.call_domino_nim. The @torch.library.custom_op pattern for integrating cuML and NVIDIA Warp kernels with torch.compile without graph breaks.

BioNeMo: ESM-2 pre-training recipes on 64 H100s with validation perplexity tables (8M: 10.26, 650M: 7.14, 3B: 6.42). Evo2's 40kb context window for genomic variant effect prediction. Geneformer's single-cell foundation model with CELLxGENE Census integration (23.87M cells). The bionemo-scdl format and convert_h5ad_to_scdl conversion pipeline.

ALCHEMI Toolkit-Ops: GPU-accelerated neighbor list construction with auto-dispatch (naive O(N²) for under 5000 atoms, cell_list O(N) for larger). Ewald summation vs PME decision boundary. DFT-D3 dispersion with BJ damping parameters per functional (PBE: a1=0.3981, a2=4.4211, s8=0.7875). Pre-allocation patterns for torch.compile compatibility.

CUDA Quantum: Kernel syntax with cudaq.qubit allocation. Noise channel composition with Kraus operators. QPU backend configs for IonQ, Quantinuum, IQM, OQC. Multi-GPU distribution with cudaq.sample_async and mqpu targets.

How It Was Built

Each skill went through a deep audit cycle. Extract every page from the official documentation. Read what exists in the skill files. Run a gap analysis.

Update or create files.

Why This Format

Skills are just markdown files. No infrastructure. No vector database. No embedding pipeline. An agent loads the file, reads it, and has the knowledge. If PhysicsNeMo ships a new release with breaking changes, you update the markdown. The "retraining" cost is editing a text file.

The constraint that matters is the 500-line limit on SKILL.md. It forces you to be opinionated. You can't dump every API — you have to decide what's important, what's the right default, what the agent should reach for first. The references/ directory is where depth lives, but the skill file is where judgment lives.

The format is fully compatible with Claude Code Skills, OpenAI Codex Skills, and the agentskills.io specification. Same directory structure, same markdown files. Drop a skill into any of these systems and it works. No adapter layer, no conversion — the skill catalog is agent-runtime agnostic by design.

This is the same pattern from my earlier work on inference-time skill acquisition. The difference is scope. That post was about a single skill built in a single session. This is a catalog that I plan to grow and maintain continuously.

Where This Is Going

The repo just launched with four NVIDIA skills. The goal is to cover every major accelerated compute workload — mature and experimental. Physics engines, quantum compilers, molecular dynamics simulators, climate models, genomics pipelines, HPC job schedulers, distributed training frameworks. If it runs on GPUs, TPUs, or QPUs and has a nontrivial API surface, it's a candidate.

Each skill gets the same treatment: deep audit against official docs, structured into the SKILL.md + references format, kept current as releases ship. The catalog grows incrementally — new skills added, existing ones updated when frameworks evolve.

The format isn't limited to NVIDIA either. Any major framework, SDK, engine, or simulator with accelerated compute backends fits the same structure. The skill just needs to encode the operational knowledge that makes an agent competent: what the APIs are, when to use what, and what breaks if you get it wrong. The long-term aim is a comprehensive reference that lets any agent — mine or yours — walk into a specialized HPC domain and be productive immediately.

The repo is open source: github.com/dtunai/agent-skills-for-compute.

Four skills and 28 reference files today, covering quantum circuits to molecular dynamics. More coming.