PyTurboQuant at a 100M Scale with Local Knowledge Base Retrieval for My Autonomous Agent System

Open North Star is the codename for my autonomous agent system. I wanted one retrieval substrate to power two related jobs there:

short-term and long-term agent memory
document retrieval over a growing knowledge base

Why I moved away from Qdrant

Qdrant is a strong vector database. The issue was not capability; it was shape.

Open North Star needs retrieval to behave like part of the agent loop, not as a separate system boundary. I wanted memory and knowledge-base retrieval to share the same substrate so new files can be indexed once, changed files can be refreshed incrementally, and the whole stack can keep moving toward the 100M-vector target without turning retrieval into a rebuild problem.

Qdrant still asks me to think in terms of a dedicated retrieval layer. PyTurboQuant fits the rest of Open North Star more cleanly because it turns retrieval into a compact primitive instead of a service I have to reason around.

The core idea is simple: vectors are quantized independently with random rotations and precomputed Lloyd-Max codebooks, so there is no codebook training, no k-means, and no data pass over the corpus before indexing starts.

That difference matters:

no extra daemon to keep alive
no separate operational layer for a small-to-medium knowledge system
one store that can serve memory and document retrieval
incremental updates without rebuilds or reindexing the whole corpus
a retrieval path that stays usable even as the knowledge base grows

In other words, I moved from "run a vector database" to "treat embeddings as a compact index."

How I wired it in

The stack is simple on purpose:

OpenAI generates the embeddings
PyTurboQuant stores them through its own vector store API
markdown files are chunked, fingerprinted, and indexed incrementally
memory entries and document chunks share the same retrieval substrate

The install is a single line:

bash
uv pip install pyturboquant

For the knowledge base, I point the indexer at my vault and let it walk the tree:

bash
python3 autoresearch/scripts/memory/index_markdown.py --project KB --prune-missing

That command does three useful things at once:

indexes new markdown files
skips files whose fingerprints have not changed
removes stale chunks for files that were deleted

That last part is what keeps the whole system live instead of brittle. I can keep writing notes, adding long documents, and refining old pages without worrying that every run will duplicate work.

How I introduced it to the agent

I didn't introduce PyTurboQuant as a library. I introduced it as a capability.

The agent now has three simple behaviors:

search before answering when the question should be grounded in prior work
write back memory when something is worth keeping
reindex only changed files when the knowledge base evolves

That keeps the prompt clean. The agent does not need to know about rotations, quantization tables, or manifest files. It just sees a reliable retrieval layer behind a few commands.

For the agent, the new mental model is:

"this project has memory"
"this project has a knowledge base"
"both are queryable with the same store"

That is exactly the kind of abstraction I wanted. The infrastructure stays hidden; the behavior becomes usable.

The first retrieval pass

To make sure the system handled meaningful queries, I started with a realistic test set: 15 queries over more than 100 long-form knowledge-base pages.

The questions were intentionally varied. Some were about implementation details. Some asked for cross-note synthesis. Some were looking for old decisions buried in longer writeups.

The result was strong enough to trust as a working retrieval layer, not just a demo.

Metric	Result
Knowledge base pages indexed	100+
Query set	15
Top-1 retrieval	13/15
Recall@3	15/15
Mean latency	~0.8s
Incremental skip rate on unchanged files	100%

That last line is the quiet killer feature. Once a page is indexed, it stays out of the way until it changes.

The KB becomes a living system

The part I like most about this setup is that it turns the knowledge base into something operational, not archival.

When I add a new markdown file, it becomes queryable. When I edit an old one, only that file gets refreshed. When I delete a file, the index stops pretending it still exists.

That means the retrieval layer keeps pace with the work instead of lagging behind it.

It also lets me use the same system for two different levels of context:

memory for compact, reusable facts and decisions
document retrieval for longer notes and project knowledge

The agent doesn't care which layer the answer came from. It just gets a better grounded answer.

What this changed for me

The biggest shift is that the agent now has a memory posture by default.

Before, retrieval felt like an external dependency. Now it is part of the workspace itself. The index lives next to the project. The retrieval path is predictable. The knowledge base can grow without turning into a rebuild problem.

That makes the whole system easier to scale in the way I actually care about: not "how many nodes can it serve," but "how much useful context can I keep close to the agent without slowing down iteration."

PyTurboQuant is a better fit for that than a traditional vector database.

The short-term target is to push this architecture to 100M vectors by splitting the corpus into sharded stores, keeping each shard incremental, and routing queries before retrieval.

If the system keeps growing, I can layer more structure on top later. But for an agentic workflow, this is the right foundation: compact, incremental, and built to scale.