One of Q's jobs is to catch project drift before it turns into rework. A client asks for something that contradicts an earlier decision. A commit touches a frozen area. A new request quietly adds scope without removing anything else.
Those are not task-management problems. They are memory and change-control problems.
Our latest conflict detector benchmark is strong enough for an initial pilot framing, but the framing matters: this should be treated as risk alerts for owner review, not as exhaustive automated prevention.
The result
The latest full run used a synthetic conflict-eval dataset with 86 scenarios, including adversarial negatives. The aggregate result:
| Outcome | Count |
|---|---|
| True positives | 52 |
| False positives | 0 |
| True negatives | 30 |
| False negatives | 4 |
That gives 100% observed precision and 92.86% recall on this dataset.
Where it performed well
The detector performed cleanly on the categories that matter most for early Q pilots:
| Conflict type | Recall | Observed precision |
|---|---|---|
| Decision drift | 100.00% | 100.00% |
| Factual contradiction | 100.00% | 100.00% |
| Stakeholder conflict | 100.00% | 100.00% |
| Temporal staleness | 100.00% | 100.00% |
| Constraint violation | 87.50% | 100.00% |
| Resource conflict | 87.50% | 100.00% |
| Scope creep | 75.00% | 100.00% |
The most important product signal is the lack of false positives in this run. A conflict system that cries wolf becomes operational noise. For an agency owner, a smaller number of high-confidence alerts is more useful than a noisy feed that needs constant triage.
System shape
At a high level, Q separates conflict detection into stages:
- Turn project artifacts into structured, provenance-aware facts.
- Compare new facts against relevant prior facts and decisions.
- Persist durable findings so evaluation and delivery can be reasoned about separately.
- Surface only the useful subset as owner-facing risk alerts.
That separation is important. Detection quality should not be confused with notification policy. The system should preserve what it found, then make a separate product decision about which findings are worth surfacing to the owner.
Where it missed
There were four false negatives. The misses clustered around scope, resource, and constraint cases where the right pair of facts was not connected early enough in the pipeline.
That is the next improvement area: recover recall on category-vs-instance cases without giving back the precision gains. In practice, that means catching more subtle "this new request violates the spirit of an earlier constraint" cases while still avoiding noisy alerts.
Product framing
This result supports an initial release as conservative risk-signal infrastructure:
- good enough to alert an owner when Q sees a likely conflict
- not framed as complete conflict prevention
- best used with human review and provenance links
- measured again against pilot data as soon as real usage exists
That is the right bar for this stage. The product should reliably surface risks that would otherwise stay hidden until they become expensive, while keeping the owner in control of the final decision.