Grok Mirage Shelf | The Validity Mirage

Latest Experiment

Answer fidelity stays relatively high while witness fidelity collapses

The shelf is small at short context, then widens sharply by 512K. This is the core empirical bridge from the theoretical mirage-gap story to a real frontier-model measurement.

Main figure

The mirage gap grows with context

Answer accuracy declines modestly across the normal condition, from 0.825 at 4K to 0.775 at 512K. Witness fidelity declines much faster, from 0.7867 to 0.5533, producing a much larger answer-vs-witness gap at the longest context.

Mirage shelf chart for Grok showing answer accuracy, witness accuracy, and their gap across context length. — Full run on `200` items. The strongest clean empirical pattern is the widening answer-vs-witness gap, not a monotone rise in corruption.

Key scores

Compact table

Normal condition

4Kanswer 0.825 · witness 0.7867 · gap +0.0383 · corruption 0.2567

16Kanswer 0.8200 · witness 0.7767 · gap +0.0433 · corruption 0.2467

64Kanswer 0.8100 · witness 0.7517 · gap +0.0583 · corruption 0.2717

256Kanswer 0.8100 · witness 0.7150 · gap +0.0950 · corruption 0.2583

512Kanswer 0.7750 · witness 0.5533 · gap +0.2217 · corruption 0.2200

Witness-removed control

4Kanswer 0.3150 · witness 0.1100 · gap +0.2050 · corruption 0.6700

64Kanswer 0.2850 · witness 0.1083 · gap +0.1767 · corruption 0.6850

256Kanswer 0.2750 · witness 0.1050 · gap +0.1700 · corruption 0.6400

Interpretation

What the full run supports

The normal-condition shelf is real: answer fidelity is relatively preserved while witness fidelity degrades materially.
The 512K point is a clear cliff: the mirage gap reaches +0.2217, the largest value in the run.
The control result is stronger than simple “drop to chance” language. The below-chance answers suggest the near-miss lines are often actively misleading once the true witness is removed.
Corruption in the normal condition is consistently present, but the strongest monotone signal in the full run is witness loss, not rising corruption.

Caveats Limitations: single model, synthetic items, and witness scoring should be treated as a lower bound because exact-sentence matching misses paraphrases, with a roughly 79% ceiling at 4K. The public bundle keeps the branch readable by publishing the chart, scores, script, and a scored sample, not the full raw output dump.

Methods

Experiment design

The run used 200 synthetic long-document items with a binary decision question, exact witness lines, near-miss lines, and distractor paragraphs. Each normal-condition request asked the model to choose an answer and copy the exact evidence sentences. The causal control removed the true witness while leaving the near-miss structure in place.

Model: grok-4-1-fast-non-reasoning
Contexts: normal at 4K, 16K, 64K, 256K, 512K; witness-removed at 4K, 64K, 256K
Scoring: answer correctness, witness recovery, corruption, and mixed-evidence rate
Execution: xAI batch API with direct replay for 13 failed rows to produce a complete local output set

Experiment README ↗ Manifest JSON ↗ Scores JSON ↗ Scored sample ↗ Run script ↗

headline

{
  "normal_4k_gap": 0.0383,
  "normal_512k_gap": 0.2217,
  "normal_4k_answer_accuracy": 0.825,
  "normal_512k_answer_accuracy": 0.775,
  "normal_4k_witness_accuracy": 0.7867,
  "normal_512k_witness_accuracy": 0.5533,
  "removed_4k_answer_accuracy": 0.315,
  "removed_256k_answer_accuracy": 0.275
}