Back

Teaching agents GPU/TPU/QPU compute: An open skill catalog

·8 min read
Teaching agents GPU/TPU/QPU compute: An open skill catalog

Agents are not much proficient about the specialized frameworks, SDKs, and simulators that run on accelerated hardware — the GPU-optimized physics engines, QPU-targeted quantum compilers, protein language model training stacks, and atomistic simulation kernels that power serious scientific and engineering compute. These are the kinds of things you only find in official docs that change every release, scattered across examples that assume you already know the framework.

I wanted my agents to be competent in these domains. Not "can probably guess the API" competent — actually knowing the current function signatures, the right patterns, the common pitfalls, which primivites to pick for which problem. Whether it's configuring a PhysicsNeMo Curator ETL pipeline, choosing between Ewald and PME for electrostatics, or setting up multi-QPU distribution with CUDA Quantum.

So I built something simpler: agent-skills-for-compute.

The Structure

Each skill is a directory with a SKILL.md (under 500 lines) and a references/ folder with deep-dive files on specific topics. The format follows the agentskills.io specification from Callstack's open-source project.

skills/
  cuda-quantum/
    SKILL.md                      # Quick reference, model tables, problem mapping
    references/
      kernels-and-gates.md        # Kernel syntax, qubit allocation, gate operations
      noise-modeling.md           # Noise channels, Kraus operators, density matrix sim
      hardware-backends.md        # QPU targets, IonQ/IQM/OQC/Quantinuum configs
      variational-algorithms.md   # VQE, QAOA, parameter optimization
      ...8 files total

  physicsnemo/
    SKILL.md
    references/
      models-and-architectures.md # 18+ architectures: GNNs, transformers, neural operators
      advanced-features.md        # torch.compile + external kernels, DGL→PyG migration
      data-pipelines.md           # PhysicsNeMo Curator ETL, HDF5/VTK/Zarr
      ...7 files total

  bionemo/
    SKILL.md
    references/
      esm2-protein-language.md    # Pre-training, fine-tuning, LoRA, checkpoints
      evo2-genomic-model.md       # 40kb context DNA model, variant effects
      geneformer-single-cell.md   # Cell type classification, GRN inference
      ...9 files total

  alchemi-toolkit-ops/
    SKILL.md
    references/
      neighbor-lists.md           # Auto-dispatch, cell list, rebuild detection
      electrostatics.md           # Ewald, PME, Coulomb, multipole, autograd
      dispersion-corrections.md   # DFT-D3(BJ), coordination numbers, 4-pass kernels
      math-and-utilities.md       # GTO, spherical harmonics, batch processing

Each SKILL.md has the same bones: an overview, a quick pattern (wrong way vs right way), a quick command for installation, a reference table of all APIs, a priority-ordered guide to the reference files, and a problem-to-skill mapping. The idea is that an agent reads SKILL.md first, then drills into whichever reference file matches the task.

What's Inside

The skills aren't summaries. They're operational knowledge — the kind of thing a senior engineer who's spent days with the framework would tell a new hire.

16 skills covering quantum computing, HPC, ML infrastructure, GPU programming, and LLM inference:

Quantum Computing:

  • CUDA Quantum: Kernel syntax with cudaq.qubit allocation. Noise channel composition with Kraus operators. QPU backend configs for IonQ, Quantinuum, IQM, OQC. Multi-GPU distribution with cudaq.sample_async and mqpu targets.
  • Qiskit: Runtime primitives (Sampler/Estimator), transpiler optimization passes, VQE/QAOA implementations, Qiskit Metal hardware design, error mitigation techniques.

Physics & Scientific Computing:

  • PhysicsNeMo: 18 model architectures with when-to-use guidance. X-MeshGraphNet for 100M+ cell meshes. Transolver with Transformer Engine fp8 on Hopper GPUs. DoMINO NIM inference via physicsnemo.cfd.inference.domino_nim.call_domino_nim. The @torch.library.custom_op pattern for integrating cuML and NVIDIA Warp kernels with torch.compile without graph breaks.
  • ALCHEMI Toolkit-Ops: GPU-accelerated neighbor list construction with auto-dispatch (naive O(N²) for under 5000 atoms, cell_list O(N) for larger). Ewald summation vs PME decision boundary. DFT-D3 dispersion with BJ damping parameters per functional (PBE: a1=0.3981, a2=4.4211, s8=0.7875). Pre-allocation patterns for torch.compile compatibility.
  • Materials Project: MPRester API for crystal structures, DFT properties, phase diagrams. pymatgen integration for structure analysis, symmetry operations, comparisons. Electronic structure (band structures, DOS), phonon properties, batch operations.
  • RDKit: Molecular I/O (SMILES, SDF), descriptors and fingerprints, substructure search with SMARTS, 3D conformer generation, stereochemistry handling, Murcko scaffold analysis.

Computational Biology:

  • BioNeMo: ESM-2 pre-training recipes on 64 H100s with validation perplexity tables (8M: 10.26, 650M: 7.14, 3B: 6.42). Evo2's 40kb context window for genomic variant effect prediction. Geneformer's single-cell foundation model with CELLxGENE Census integration (23.87M cells). The bionemo-scdl format and convert_h5ad_to_scdl conversion pipeline.

GPU Programming & Kernels:

  • Triton: Block-based GPU programming with @triton.jit. Flash Attention v2 (111-166 TFLOPS), persistent kernels with TMA, FP8/FP4 quantization. Auto-tuning configs, CUDA/HIP/CPU backends, TRITON_INTERPRET mode for CPU debugging. C-style integer division semantics (critical difference from Python).
  • FlashInfer (MLSys 2025 Best Paper): Paged KV-cache attention kernels for LLM serving. 29-69% inter-token latency reduction. Cascade inference (31x speedup for shared prefix). FP8/FP4 GEMM, MoE fusion, sorting-free sampling O(log V). vLLM/SGLang integration.

LLM & Generative AI:

  • Nemotron: Hybrid Mamba-Transformer MoE architecture, 1M context window. NIM microservices for deployment. Synthetic data generation with NemotronGenerator. NeMo Framework pretraining recipes, multi-token prediction, NVFP4 training.
  • Omniverse SimReady: OpenUSD asset specification for physically accurate 3D simulation. USDPhysics schema, semantic labeling with WikiData QCodes, Isaac Sim integration for robotics, synthetic data generation with Replicator.

HPC & Distributed Computing:

  • SLURM: sbatch/srun job submission, GPU/GRES allocation with --gres=gpu:a100:4, job arrays with concurrency control, squeue/sacct monitoring, QOS policies.
  • Dask: Distributed DataFrames/Arrays, LocalCUDACluster for RAPIDS cuDF/CuPy, Dask-Jobqueue for HPC integration, lazy evaluation with query optimization.
  • Ray: Distributed tasks/actors, Ray Data for batch processing, Ray Train for distributed training, Ray Serve for model serving, Ray Tune for hyperparameter optimization.
  • MPI: Point-to-point (Send/Recv), collective operations (Broadcast, Scatter, Reduce), MPI-IO for parallel I/O, process mapping, SLURM integration with srun.

Data Processing:

  • Polars: Lazy query optimization with predicate/projection pushdown, GPU acceleration via RAPIDS cuDF engine, expressions API with window functions, Parquet/CSV I/O with cloud storage support.

How It Was Built

Each skill went through a deep audit cycle. Extract every page from the official documentation. Read what exists in the skill files. Run a gap analysis.

Update or create files.

Why This Format

Skills are just markdown files. No infrastructure. No vector database. No embedding pipeline. An agent loads the file, reads it, and has the knowledge. If PhysicsNeMo ships a new release with breaking changes, you update the markdown. The "retraining" cost is editing a text file.

The constraint that matters is the 500-line limit on SKILL.md. It forces you to be opinionated. You can't dump every API — you have to decide what's important, what's the right default, what the agent should reach for first. The references/ directory is where depth lives, but the skill file is where judgment lives.

The format is fully compatible with Claude Code Skills, OpenAI Codex Skills, and the agentskills.io specification. Same directory structure, same markdown files. Drop a skill into any of these systems and it works. No adapter layer, no conversion — the skill catalog is agent-runtime agnostic by design.

This is the same pattern from my earlier post on inference-time skill acquisition. The difference is scope. That post was about a single skill built in a single session. This is a catalog that I plan to grow and maintain continuously.

Where This Is Going

The catalog has grown from the initial 4 NVIDIA skills to 16 comprehensive skills covering quantum computing, physics simulation, computational biology, GPU programming, LLM inference, HPC schedulers, distributed computing, and data processing frameworks.

The goal remains: cover every major accelerated compute workload — mature and experimental. Physics engines, quantum compilers, molecular dynamics simulators, climate models, genomics pipelines, HPC job schedulers, distributed training frameworks, GPU kernel libraries. If it runs on GPUs, TPUs, or QPUs and has a nontrivial API surface, it's a candidate.

Each skill gets the same treatment: deep audit against official docs, structured into the SKILL.md + references format, kept current as releases ship. The catalog grows incrementally — new skills added, existing ones updated when frameworks evolve.

Recent additions show the breadth: Triton for writing custom GPU kernels with block-based programming (10 reference files covering Flash Attention, persistent kernels, FP8 quantization, debugging with TRITON_INTERPRET, CUDA/HIP/CPU backends). FlashInfer for production LLM serving (6 reference files on paged KV-cache, cascade inference achieving 31x speedup, FP4/FP8 GEMM, sorting-free sampling, vLLM/SGLang integration). Qiskit for IBM Quantum with runtime primitives and hardware design. Materials Project for DFT-computed crystal properties. Polars for GPU-accelerated DataFrames.

The format isn't limited to NVIDIA. The catalog includes quantum frameworks (CUDA Quantum, Qiskit), open-source HPC tools (SLURM, MPI, Dask, Ray), cheminformatics (RDKit), materials science (Materials Project), and cross-platform GPU programming (Triton supports CUDA/HIP/CPU).

Any major framework, SDK, engine, or simulator with accelerated compute backends fits the same structure. The skill just needs to encode the operational knowledge that makes an agent competent: what the APIs are, when to use what, and what breaks if you get it wrong. The long-term aim is a comprehensive reference that lets any agent — mine or yours — walk into a specialized HPC domain and be productive immediately.

The repo is open source: github.com/dtunai/agent-skills-for-compute.

16 skills and 110+ reference files today, covering quantum circuits, molecular dynamics, GPU kernel programming, LLM inference optimization, HPC job scheduling, distributed training, and scientific data processing. More coming.