Accord Book Retrieval Benchmarks: What the April 2026 Run Showed

Accord Book's April 2026 LongMemEval-style evaluation cleared all production-fit thresholds: 990ms retrieval p95, 90% judged accuracy, and 0 API errors — validated end-to-end through the full pipeline.

Accord Book is built around a core promise: when a team asks for project context, the system returns the decisions, constraints, and history that still matter now.

Benchmarking retrieval is how we hold that promise to account.

Here is what the April 2026 evaluation showed.

The benchmark run

Retrieval evolution timeline showing why Accord Book now describes the default retrieval path as vector, lexical, and supersedes-aware reranking

The April 22, 2026 LongMemEval-style production-fit analysis ran against a 20-entry benchmark subset. Every threshold passed:

Metric	Baseline	Treatment	Threshold	Result
Retrieval p95	1431.4 ms	990.1 ms	< 1500 ms	Pass
POST-curation `llm_correct`	90.0%	90.0%	>= 70%	Pass
Extraction p95 / max	517.0 s / 536 s	—	< 600 s	Pass
Worst curation stage max	76 s	—	< 180 s	Pass
API 4xx/5xx	0	0	0	Pass

These numbers reflect an end-to-end run through ingestion, extraction, curation, retrieval, and LLM-based judging — not an isolated search test.

Three things this confirmed

1. The pipeline works end-to-end

Accord Book's retrieval was benchmarked through the full stack: ingestion, extraction, curation, retrieval, and judging. That is a stronger signal than a standalone vector similarity test and validates the architecture as a system, not just individual components.

2. Judge-based evaluation surfaces what string matching misses

The run recorded 80% on string match and 90% on judged correctness. In a project-memory system, correct answers are often grounded and materially accurate without matching a stored fixture verbatim. LLM-based judge evaluation is the right tool for that distinction.

3. Supersedes-aware reranking improved latency without trading accuracy

The treatment path hit 990ms retrieval p95 against 1431ms for the baseline — roughly a 30% improvement — while holding the same judged correctness. That validates continued investment in the retrieval stack as an engineered system, not just a vector index.

What's next

The April run used a 20-entry subset. Upcoming work includes validating against a larger dataset, evaluating the graph-enabled retrieval path on entity-dense project scenarios, and expanding the benchmark suite as Accord Book moves toward broader pilot availability.

For the full retrieval architecture story, read Why Production Retrieval Needs More Than Vector Search. For the graph-positioning decision, read How We Decide When Graph Belongs on the Retrieval Path.