MirageKit
The papers, theory, witness, and evaluation framing for validity mirage behavior.
Context compression can look valid while the governing task has drifted.
When LLM agents compress long conversations, they can silently lose track of which task they're solving while still producing confident answers. We call this a validity mirage. MirageKit is the research program; dreams is the public showcase for papers and artifacts; the evaluation MCP lives in the separate tropical-mcp repository.
Read this like a research packet: flagship paper, deterministic replay witness (n=3 per policy and retention fraction), mirrored validation logs, certificate artifact, and two verification paths: a three-call smoke test plus a fuller reviewer workflow.
Start with the flagship paper, inspect the replayed witness, then run the local verify path to reproduce the divergence yourself.
The papers, theory, witness, and evaluation framing for validity mirage behavior.
This website, working-paper bundle, and committed public artifacts live here.
The source-available MCP server you register in Codex or Claude-style clients to evaluate guarded compaction directly.
First public release (working-paper stage, 2026) · dreams = public evidence surface · tropical-mcp = source-available evaluation MCP · DOI-backed archive = dreams v0.1.1 · mirrored implementation release = tropical-mcp v0.2.1.
Use this repo for the working papers, replay artifacts, and committed evidence that support the current claims.
The runnable MCP server is published in its own repository so install docs, changelog, tests, and examples stay close to the code.
Move from the replay cards to the witness, certificate, and source papers. Every number on the page comes from committed artifacts.
Use the three-call smoke test for a fast implementation check, or extend to the fuller reviewer workflow when you want diagnostics, anchors, and telemetry.
tropical-mcp is the evaluation implementation. Use dreams for the paper set, replay witness, public certificate, and the broader research narrative.
tropical-mcpClone the evaluation repo, then register the MCP in Codex so the tool calls stay explicit and auditable.
git clone https://github.com/jack-chaudier/tropical-mcp.git ~/tropical-mcp
codex mcp add tropical-mcp \
--env TROPICAL_MCP_CLIENT=codex -- \
uv --directory ~/tropical-mcp run tropical-mcp
codex mcp list
Expected signalcodex mcp list should show tropical-mcp as an available server.
Use a small explicit payload so the pivot and predecessor structure remain visible at a glance. This verifies the packaged MCP surface; it is not the full research workflow.
messages = [
{
"id": "goal",
"role": "user",
"content": "Build a long-running coding agent workflow for Codex.",
"role_hint": "pivot",
},
{
"id": "constraint_stdio",
"role": "user",
"content": "Use stdio transport and never emit JSON-RPC data to stdout logs.",
"role_hint": "predecessor",
},
{
"id": "constraint_clients",
"role": "user",
"content": "Support Codex and Claude-style clients through explicit MCP tool calls.",
"role_hint": "predecessor",
},
{
"id": "status",
"role": "assistant",
"content": "I am wiring the verification flow and docs.",
"role_hint": "noise",
},
]
runtime_info()
compact_auto(
messages=messages,
token_budget=45,
k_target=2,
mode="adaptive",
)
certificate(
messages=messages,
token_budget=45,
k=2,
)
Expected signalcompact_auto(...) should prefer the guarded policy on this witness payload, and certificate(...) should preserve a portable audit of the same comparison.
For a fuller reviewer pass, continue with diagnose(...), context_anchor(...), and telemetry_summary(...) so feasibility, protected chunks, and retention behavior remain explicit in the audit trail.
runtime_info() should report the resolved client and telemetry path. compact_auto(...) should choose l2_guarded on the sample. certificate(...) should show comparable recency vs guarded kept and dropped IDs.
After the smoke test, the fuller review sequence is runtime_info(), diagnose(...), context_anchor(...), compact_auto(...), certificate(...), and telemetry_summary(...). That path keeps witness feasibility, protected predecessors, and retained context visible instead of inferring them after the fact.
Implementation docs and the Codex example bundle live in the tropical-mcp repository. The rendered evidence surface for this site lives in the evidence dossier, with direct links there to the replay witness, validation summary, and certificate artifact when you want to inspect the underlying files.
tropical-mcp is currently source-available for academic and internal evaluation. For redistribution, derivative, or commercial rights, see the repository license or contact the author.
Keep the most recent messages, drop the oldest.
Keep messages that the current task depends on, even if old.
d_pre >= k
Guarded compaction is certified safe only when the pivot retains its required predecessor depth.
W[k] = -infinity -> infeasible
If the k-slot frontier is negative infinity, no valid completion exists for that retained context.
raw = max(primary_full, decoy_full)
Answerability alone can stay high even when pivot identity has silently changed.
delta = raw - pivot_preservation
Large positive gap indicates a validity mirage regime rather than true semantic stability.
Current evidence combines a small committed replay witness with broader paper-level studies. The strongest claim in this demo is structural: naive recency can preserve answerability while losing pivot integrity.
results/replay/ with n=3 variants per policy and retention fraction.