MirageKit Research Program

MirageKit

Context compression can look valid while the governing task has drifted.

When LLM agents compress long conversations, they can silently lose track of which task they're solving while still producing confident answers. We call this a validity mirage. MirageKit is the research program; dreams is the public showcase for papers and artifacts; the evaluation MCP lives in the separate tropical-mcp repository.

Read this like a research packet: flagship paper, deterministic replay witness (n=3 per policy and retention fraction), mirrored validation logs, certificate artifact, and two verification paths: a three-call smoke test plus a fuller reviewer workflow.

Start with the flagship paper, inspect the replayed witness, then run the local verify path to reproduce the divergence yourself.

Research Program

MirageKit

The papers, theory, witness, and evaluation framing for validity mirage behavior.

Showcase Repo

dreams

This website, working-paper bundle, and committed public artifacts live here.

Evaluation Implementation

tropical-mcp

The source-available MCP server you register in Codex or Claude-style clients to evaluate guarded compaction directly.

First public release (working-paper stage, 2026) · dreams = public evidence surface · tropical-mcp = source-available evaluation MCP · DOI-backed archive = dreams v0.1.1 · mirrored implementation release = tropical-mcp v0.2.1.

Start Here

Read The Research

Papers + Results

Use this repo for the working papers, replay artifacts, and committed evidence that support the current claims.

View The Implementation

Evaluate tropical-mcp

The runnable MCP server is published in its own repository so install docs, changelog, tests, and examples stay close to the code.

Understand The Demo

Interactive Proof Path

Move from the replay cards to the witness, certificate, and source papers. Every number on the page comes from committed artifacts.

Verify The Tool

Smoke Test + Research Workflow

Use the three-call smoke test for a fast implementation check, or extend to the fuller reviewer workflow when you want diagnostics, anchors, and telemetry.

Evaluate + Verify

tropical-mcp is the evaluation implementation. Use dreams for the paper set, replay witness, public certificate, and the broader research narrative.

1. Register the MCP 2. Run the minimal smoke test 3. Expand to the research workflow
01
Codex Registration

Register tropical-mcp

Clone the evaluation repo, then register the MCP in Codex so the tool calls stay explicit and auditable.

codex shell
git clone https://github.com/jack-chaudier/tropical-mcp.git ~/tropical-mcp
codex mcp add tropical-mcp \
  --env TROPICAL_MCP_CLIENT=codex -- \
  uv --directory ~/tropical-mcp run tropical-mcp
codex mcp list

Expected signalcodex mcp list should show tropical-mcp as an available server.

02
Three-Call Smoke Test

Minimal Verification

Use a small explicit payload so the pivot and predecessor structure remain visible at a glance. This verifies the packaged MCP surface; it is not the full research workflow.

payload
messages = [
  {
    "id": "goal",
    "role": "user",
    "content": "Build a long-running coding agent workflow for Codex.",
    "role_hint": "pivot",
  },
  {
    "id": "constraint_stdio",
    "role": "user",
    "content": "Use stdio transport and never emit JSON-RPC data to stdout logs.",
    "role_hint": "predecessor",
  },
  {
    "id": "constraint_clients",
    "role": "user",
    "content": "Support Codex and Claude-style clients through explicit MCP tool calls.",
    "role_hint": "predecessor",
  },
  {
    "id": "status",
    "role": "assistant",
    "content": "I am wiring the verification flow and docs.",
    "role_hint": "noise",
  },
]
verify
runtime_info()
compact_auto(
  messages=messages,
  token_budget=45,
  k_target=2,
  mode="adaptive",
)
certificate(
  messages=messages,
  token_budget=45,
  k=2,
)

Expected signalcompact_auto(...) should prefer the guarded policy on this witness payload, and certificate(...) should preserve a portable audit of the same comparison.

For a fuller reviewer pass, continue with diagnose(...), context_anchor(...), and telemetry_summary(...) so feasibility, protected chunks, and retention behavior remain explicit in the audit trail.

Expected signals

runtime_info() should report the resolved client and telemetry path. compact_auto(...) should choose l2_guarded on the sample. certificate(...) should show comparable recency vs guarded kept and dropped IDs.

Recommended research workflow

After the smoke test, the fuller review sequence is runtime_info(), diagnose(...), context_anchor(...), compact_auto(...), certificate(...), and telemetry_summary(...). That path keeps witness feasibility, protected predecessors, and retained context visible instead of inferring them after the fact.

Source of truth

Implementation docs and the Codex example bundle live in the tropical-mcp repository. The rendered evidence surface for this site lives in the evidence dossier, with direct links there to the replay witness, validation summary, and certificate artifact when you want to inspect the underlying files.

License boundary

tropical-mcp is currently source-available for academic and internal evaluation. For redistribution, derivative, or commercial rights, see the repository license or contact the author.

What happens when you compress a conversation?
Retention Budget

Full context retained; both policies remain aligned.

Checkpoint 100%
40% 50% 65% 80% 100%

Observed replay checkpoints from committed artifacts. No synthetic interpolation.

Regime Aligned

Naive Recency

Keep the most recent messages, drop the oldest.

vs

Tropical L2 · Guarded

Keep messages that the current task depends on, even if old.

Math Snapshot

Core Contract

d_pre >= k

Guarded compaction is certified safe only when the pivot retains its required predecessor depth.

Frontier Feasibility

W[k] = -infinity -> infeasible

If the k-slot frontier is negative infinity, no valid completion exists for that retained context.

Raw Validity

raw = max(primary_full, decoy_full)

Answerability alone can stay high even when pivot identity has silently changed.

Mirage Gap

delta = raw - pivot_preservation

Large positive gap indicates a validity mirage regime rather than true semantic stability.

Evidence Boundaries

Current evidence combines a small committed replay witness with broader paper-level studies. The strongest claim in this demo is structural: naive recency can preserve answerability while losing pivot integrity.