Speed Is the Wrong Axis: Benchmarking Draft's Graph Engine Against grep

The Bottom Line

Every AI coding agent needs to answer structural questions about your code: what calls this function, what does this change touch, where are the hotspots, are there dependency cycles. The default tool for that is text search — grep, ripgrep, the IDE's "find usages." Draft instead queries a local knowledge graph of your repo, served by codebase-memory-mcp.

So we measured the difference. Same questions, two interfaces, two real codebases, median of 5 timed runs. Three findings:

grep usually wins raw latency. On trivial literal lookups it is milliseconds; the engine's per-call CLI startup is tens to hundreds of ms. This is the axis people fixate on, and it is the one that matters least — both are a rounding error next to a 3–30 second model turn.
The engine wins accuracy decisively. On "rank the top-10 hottest functions," grep scored 10–50% precision and recall (it ranks variable names and keywords as if they were functions); the engine scored 100% / 100%. grep cannot express dependency cycles or transitive blast radius at all.
The engine wins tokens and cost by 5×–108×. To reach the correct answer, grep forces the agent to read the files its noisy matches point into. On the large repo, disambiguating the callers of one function cost grep 289,000 tokens; the engine answered in 2,669. That is the number that actually shows up on your bill.

And the gap grows with codebase size — which is exactly backwards from what you want grep to do.

What We Measured

Two corpora, chosen to bracket the range most teams live in:

Corpus A — a ~17K-LOC Rust service (47 source files). The engine indexed it to ~3,100 nodes and ~7,700 edges.
Corpus B — a ~221K-LOC polyglot platform (Python-dominant, with TypeScript and Rust; ~1,200 Python files in total). The engine indexed it to ~32,500 nodes and ~191,000 edges.

The engine is codebase-memory-mcp — a single local binary (tree-sitter + LSP, 159 languages, 100% local, no API key) that indexes a repo into a SQLite knowledge graph and answers structural queries on demand. Draft drives it through the scripts/tools/graph-*.sh wrappers (graph-arch, hotspot-rank, graph-callers, cycle-detect, graph-impact). The baseline is GNU grep + find — the fairest "no special tooling" floor. We measured wall-clock latency, answer accuracy against hand-verified ground truth, and the tokens an agent must ingest to reach the correct answer (counted with tiktoken o200k_base).

Latency: grep Often Wins, and It Doesn't Matter

Five canonical tasks, engine vs grep, on the large polyglot repo:

Latency per task — large polyglot repo

Wall-clock, median of 5 runs (lower is better). grep wins four of five — and both sit far under a 3–30 s model turn. Accuracy winner noted per row.

Engine grep

Module inventory grep 170× · engine adds edges

170 ms

1 ms

Hotspot top-10 (fan-in) grep 35× · engine accurate

1050 ms

30 ms

Callers of a function grep 16× · engine accurate

310 ms

20 ms

Dependency cycles engine only

1020 ms

grep: n/a

Impact / blast radius grep 2× · engine only

40 ms

20 ms

grep is faster on four of five tasks, sometimes by a lot. We are not going to pretend otherwise. But both interfaces are sub-second, and a single agent turn — one model inference — takes 3 to 30 seconds. Shaving 250 ms off a tool call you make a handful of times per turn is invisible. The engine's latency is also an upper bound: each call pays CLI startup plus a stateless query against the on-disk graph. A persistent MCP server amortizes that away. Latency is the axis that's easy to measure and irrelevant to optimize. The decisive axes are the next two.

Accuracy: grep Hands Your Agent Noise

This is where text search quietly fails. grep matches characters; it has no concept of a function, a call edge, or a module boundary. On real code, that produces both false positives (it matches things that aren't what you asked for) and false negatives (it misses things that are).

Hotspot top-10 — precision & recall, large repo

Higher is better. grep ranks variable names and keywords (is, in, config, len, status…) as if they were functions, so it scores near-zero on both.

Graph engine grep baseline

Precision 10× gap

100% · 10/10 real

10% · 9/10 noise

Recall 8× gap

100%

12% · missed nearly all

The structural queries grep cannot express in percentages at all:

Callers of a function — engine: 80 precise call edges, test/prod-classified. grep: 202 word-matches (180 in tests, 8 definitions, 22 unrelated).
Dependency cycles — engine: SCCs computed. grep: 0, not expressible in text search (∞ gap).
Blast radius of a change — engine: 44-node transitive closure. grep: 490 flat hits, no transitivity, no edges.

The recall trap is the killer. Ask grep for the callers of submit_order and grep -wn submit_order returns 202 lines — mostly tests and definitions — and zero of the three real production call sites. Why? Because in production the function is invoked as a method: self._order_manager.submit_order(...). A search for submit_order( never sees .submit_order(. The agent now has to refine its regex and read source to recover the calls grep missed, all while wading through 180 test matches it didn't want. The engine returns all 80 call edges, each attributed to an exact file:function and classified test-vs-prod, in one call. One is a confident answer; the other is a research project.

Tokens and Cost: The Bill You Actually Pay

Accuracy and cost are the same story told twice. To turn grep's noisy output into a correct answer, the agent has to read the files those matches point into — to tell a definition from a test from a string from a qualified call. The engine's structured output needs no follow-up reads. Here is what each path costs in tokens to reach the correct answer, on both repos:

Tokens to reach the correct answer

Lower is better. Bar lengths are on a log scale (the raw gap spans ~770×); the ratio on each row is the real linear multiple. Counted with tiktoken o200k_base.

Engine grep

Callers of one function — small Rust repo 27×

2,424

66,534

Callers of one function — large polyglot repo 108×

2,669

289,096

Full architecture pass — small Rust repo 5.8×

24,218

141,297

Full architecture pass — large polyglot repo 39×

48,195

1,859,804

Notice the direction: on the small repo the caller query is 27× cheaper through the engine; on the large repo it's 108×. grep's cost scales with how much noise it produces, which scales with the size of the codebase. The engine's output stays bounded. The bigger your repo, the more the engine saves — the opposite of how text search degrades.

Translate tokens to dollars at current Claude input prices (Opus 4.8 $5/MTok, Sonnet 4.6 $3/MTok, Haiku 4.5 $1/MTok) and the per-query gap looks small — fractions of a cent vs a few cents. It stops looking small at fleet scale. Here is one illustrative workload — a team running 1,000 caller-disambiguation queries a day — on the large repo:

Annual cost — 1,000 caller queries/day, large repo

Lower is better. Bars are linear; the bold figure is avoidable spend per year. Illustrative volume — the ratio is the property that holds.

Engine $/yr grep $/yr

Opus 4.8 $523k avoidable

$4.9k

$528k

Sonnet 4.6 $314k avoidable

$2.9k

$317k

Haiku 4.5 $105k avoidable

$1.0k

$106k

The exact dollars depend on your volume and model mix — treat them as illustrative. The ratio doesn't: it's a property of how much each path makes the agent read, and it's tokenizer-independent.

The Scaling Wall: When grep Stops Fitting at All

There's a harder limit than cost. An agent has a finite context window. The grep-to-correct path and the read-the-whole-repo path both blow past it on a large codebase — the engine's structured output does not:

Path (large repo)	Tokens	Fits 200K window?	Fits 1M window?
grep-to-correct, one caller query	289K	no	yes
Read all source for an architecture pass	1.86M	no	no
Engine architecture pass (all 5 queries)	48K	yes	yes, bounded

This is the part that isn't a tradeoff. On a 221K-LOC repo, "just read the code" is not an option at any price — it doesn't fit. The engine is the only path that scales to a codebase an agent can't hold in its head. Bounded, structured output is what lets an agent reason about a repo far larger than its context window.

The Honest Caveats

A benchmark that only flatters its subject isn't worth much. Three things to keep in mind:

grep genuinely wins raw latency on trivial literal lookups, and the engine's per-call timing is an upper bound (CLI startup + stateless query). A persistent server closes that gap; none of it matters next to a model turn. If all you need is "does this string appear," grep is the right tool — reach for the graph when the question is structural.
Cycle detection over-reports. On the large repo the engine flagged 100 cycles, but most were short self-referential artifacts (route registrations, formatter/util trios) rather than true module-level import cycles. The headline isn't the count — it's the capability: grep produces zero, because cycles aren't expressible in text search.
Static analysis has blind spots. The engine's call edges are precise where they're statically resolvable and can be silently incomplete or over-resolved at dynamic-dispatch boundaries — dependency injection, closures, interface/adapter indirection (e.g. a call routed through a port trait). The right workflow is graph-first to navigate, then confirm the dynamically-wired spine at the source. The engine narrows the search from 490 candidates to a 44-node closure; you verify the handful of dynamic edges by hand instead of reading everything.

One more on the numbers: tiktoken runs ~15–20% below Claude's own tokenizer, so the absolute token and dollar figures are a conservative floor — the real Claude counts are higher. The cross-path ratios are tokenizer-independent.

What This Means for Using Draft

You don't operate any of this directly. Run /draft:init once and Draft builds the graph; from then on every command — /draft:review, /draft:bughunt, /draft:decompose, /draft:impact — queries it live instead of grepping blind. When the agent needs to know what a change touches, it gets a transitive closure, not 490 flat hits. When it ranks hotspots, it ranks real functions by fan-in, not keywords by frequency. And it spends a few thousand tokens doing it instead of a few hundred thousand.

The engine runs 100% locally — no API key, no SaaS, no source code leaving your machine. The graph is queried live, so it never goes stale between commits (the only committed artifact is a tiny gate marker at draft/graph/schema.yaml). It's free and open source.

grep is a fine tool for finding a string. It is a poor substitute for understanding a codebase — and on a large repo, an expensive one. The whole point of giving an AI agent structural memory is to stop making it read the haystack to find the needle.