A 200-item long-context experiment showing answer accuracy surviving longer than witness fidelity.
This run tested Grok on synthetic long-document decision items from 4K to 512K context.
The main empirical result is clean: in the normal condition, answer accuracy declines modestly while witness fidelity falls much faster,
widening the mirage gap from +0.038 at 4K to +0.222 at 512K.
The witness-removed causal control falls well below chance, from 0.315 at 4K to
0.275 at 256K. The conservative interpretation is that nearby near-miss evidence can actively
pull the model toward the wrong answer when the true witness is missing.
Answer fidelity stays relatively high while witness fidelity collapses
The shelf is small at short context, then widens sharply by 512K. This is the core empirical bridge
from the theoretical mirage-gap story to a real frontier-model measurement.
01
Main figure
The mirage gap grows with context
Answer accuracy declines modestly across the normal condition, from 0.825 at 4K to
0.775 at 512K. Witness fidelity declines much faster, from 0.7867 to
0.5533, producing a much larger answer-vs-witness gap at the longest context.
Full run on 200 items. The strongest clean empirical pattern is the widening answer-vs-witness gap, not a monotone rise in corruption.
The normal-condition shelf is real: answer fidelity is relatively preserved while witness fidelity degrades materially.
The 512K point is a clear cliff: the mirage gap reaches +0.2217, the largest value in the run.
The control result is stronger than simple “drop to chance” language. The below-chance answers suggest the near-miss lines are often actively misleading once the true witness is removed.
Corruption in the normal condition is consistently present, but the strongest monotone signal in the full run is witness loss, not rising corruption.
Caveats
Limitations: single model, synthetic items, and witness scoring should be treated as a lower bound because exact-sentence
matching misses paraphrases, with a roughly 79% ceiling at 4K. The public bundle keeps the branch
readable by publishing the chart, scores, script, and a scored sample, not the full raw output dump.
Methods
Experiment design
The run used 200 synthetic long-document items with a binary decision question, exact witness lines, near-miss lines,
and distractor paragraphs. Each normal-condition request asked the model to choose an answer and copy the exact evidence sentences.
The causal control removed the true witness while leaving the near-miss structure in place.
Model:grok-4-1-fast-non-reasoning
Contexts: normal at 4K, 16K, 64K, 256K, 512K; witness-removed at 4K, 64K, 256K
Scoring: answer correctness, witness recovery, corruption, and mixed-evidence rate
Execution: xAI batch API with direct replay for 13 failed rows to produce a complete local output set