Conflict detector benchmark: high precision, with honest recall caveats

The latest Q conflict-eval run found 52 true positives with zero observed false positives across 86 scenarios.

One of Q's jobs is to catch project drift before it turns into rework. A client asks for something that contradicts an earlier decision. A commit touches a frozen area. A new request quietly adds scope without removing anything else.

Those are not task-management problems. They are memory and change-control problems.

Our latest conflict detector benchmark is strong enough for an initial pilot framing, but the framing matters: this should be treated as risk alerts for owner review, not as exhaustive automated prevention.

86 scenarios run

100% precision in this run

92.86% recall in this run

0 false positives observed

The result

The latest full run used a synthetic conflict-eval dataset with 86 scenarios, including adversarial negatives. The aggregate result:

Outcome	Count
True positives	52
False positives	0
True negatives	30
False negatives	4

That gives 100% observed precision and 92.86% recall on this dataset.

First-run precision target: 70%

70%

Latest precision: 100%

100%

First-run recall target: 60%

60%

Latest recall: 92.86%

92.86%

Where it performed well

The detector performed cleanly on the categories that matter most for early Q pilots:

Conflict type	Recall	Observed precision
Decision drift	100.00%	100.00%
Factual contradiction	100.00%	100.00%
Stakeholder conflict	100.00%	100.00%
Temporal staleness	100.00%	100.00%
Constraint violation	87.50%	100.00%
Resource conflict	87.50%	100.00%
Scope creep	75.00%	100.00%

The most important product signal is the lack of false positives in this run. A conflict system that cries wolf becomes operational noise. For an agency owner, a smaller number of high-confidence alerts is more useful than a noisy feed that needs constant triage.

System shape

At a high level, Q separates conflict detection into stages:

Turn project artifacts into structured, provenance-aware facts.
Compare new facts against relevant prior facts and decisions.
Persist durable findings so evaluation and delivery can be reasoned about separately.
Surface only the useful subset as owner-facing risk alerts.

That separation is important. Detection quality should not be confused with notification policy. The system should preserve what it found, then make a separate product decision about which findings are worth surfacing to the owner.

Where it missed

There were four false negatives. The misses clustered around scope, resource, and constraint cases where the right pair of facts was not connected early enough in the pipeline.

That is the next improvement area: recover recall on category-vs-instance cases without giving back the precision gains. In practice, that means catching more subtle "this new request violates the spirit of an earlier constraint" cases while still avoiding noisy alerts.

Product framing

This result supports an initial release as conservative risk-signal infrastructure:

good enough to alert an owner when Q sees a likely conflict
not framed as complete conflict prevention
best used with human review and provenance links
measured again against pilot data as soon as real usage exists

That is the right bar for this stage. The product should reliably surface risks that would otherwise stay hidden until they become expensive, while keeping the owner in control of the final decision.