MirageKit Research Program

Grok Mirage Shelf

A 200-item long-context experiment showing answer accuracy surviving longer than witness fidelity.

This run tested Grok on synthetic long-document decision items from 4K to 512K context. The main empirical result is clean: in the normal condition, answer accuracy declines modestly while witness fidelity falls much faster, widening the mirage gap from +0.038 at 4K to +0.222 at 512K.

The witness-removed causal control falls well below chance, from 0.315 at 4K to 0.275 at 256K. The conservative interpretation is that nearby near-miss evidence can actively pull the model toward the wrong answer when the true witness is missing.

Latest Experiment

Answer fidelity stays relatively high while witness fidelity collapses

The shelf is small at short context, then widens sharply by 512K. This is the core empirical bridge from the theoretical mirage-gap story to a real frontier-model measurement.

01
Main figure

The mirage gap grows with context

Answer accuracy declines modestly across the normal condition, from 0.825 at 4K to 0.775 at 512K. Witness fidelity declines much faster, from 0.7867 to 0.5533, producing a much larger answer-vs-witness gap at the longest context.

Mirage shelf chart for Grok showing answer accuracy, witness accuracy, and their gap across context length.
Full run on 200 items. The strongest clean empirical pattern is the widening answer-vs-witness gap, not a monotone rise in corruption.
02
Key scores

Compact table

Normal condition

4Kanswer 0.825 · witness 0.7867 · gap +0.0383 · corruption 0.2567
16Kanswer 0.8200 · witness 0.7767 · gap +0.0433 · corruption 0.2467
64Kanswer 0.8100 · witness 0.7517 · gap +0.0583 · corruption 0.2717
256Kanswer 0.8100 · witness 0.7150 · gap +0.0950 · corruption 0.2583
512Kanswer 0.7750 · witness 0.5533 · gap +0.2217 · corruption 0.2200

Witness-removed control

4Kanswer 0.3150 · witness 0.1100 · gap +0.2050 · corruption 0.6700
64Kanswer 0.2850 · witness 0.1083 · gap +0.1767 · corruption 0.6850
256Kanswer 0.2750 · witness 0.1050 · gap +0.1700 · corruption 0.6400
03
Interpretation

What the full run supports

  • The normal-condition shelf is real: answer fidelity is relatively preserved while witness fidelity degrades materially.
  • The 512K point is a clear cliff: the mirage gap reaches +0.2217, the largest value in the run.
  • The control result is stronger than simple “drop to chance” language. The below-chance answers suggest the near-miss lines are often actively misleading once the true witness is removed.
  • Corruption in the normal condition is consistently present, but the strongest monotone signal in the full run is witness loss, not rising corruption.

Caveats Limitations: single model, synthetic items, and witness scoring should be treated as a lower bound because exact-sentence matching misses paraphrases, with a roughly 79% ceiling at 4K. The public bundle keeps the branch readable by publishing the chart, scores, script, and a scored sample, not the full raw output dump.

Methods

Experiment design

The run used 200 synthetic long-document items with a binary decision question, exact witness lines, near-miss lines, and distractor paragraphs. Each normal-condition request asked the model to choose an answer and copy the exact evidence sentences. The causal control removed the true witness while leaving the near-miss structure in place.

  • Model: grok-4-1-fast-non-reasoning
  • Contexts: normal at 4K, 16K, 64K, 256K, 512K; witness-removed at 4K, 64K, 256K
  • Scoring: answer correctness, witness recovery, corruption, and mixed-evidence rate
  • Execution: xAI batch API with direct replay for 13 failed rows to produce a complete local output set
headline
{
  "normal_4k_gap": 0.0383,
  "normal_512k_gap": 0.2217,
  "normal_4k_answer_accuracy": 0.825,
  "normal_512k_answer_accuracy": 0.775,
  "normal_4k_witness_accuracy": 0.7867,
  "normal_512k_witness_accuracy": 0.5533,
  "removed_4k_answer_accuracy": 0.315,
  "removed_256k_answer_accuracy": 0.275
}