Architecture and Modes

Source scope as of July 1, 2026

Component names, the pipeline diagram, and the lifecycle stages below are quoted from the Headroom README. Internal behavior may change between releases — verify against the architecture docs before depending on specifics.

1. The pipeline

Headroom runs locally and processes context in a fixed order before it reaches the provider:

 Your agent / app
   (Claude Code, Cursor, Codex, LangChain, your own code…)
        │   prompts · tool outputs · logs · RAG results · files
        ▼
    ┌────────────────────────────────────────────────────┐
    │  Headroom   (runs locally — your data stays here)   │
    │  ────────────────────────────────────────────────  │
    │  CacheAligner  →  ContentRouter  →  CCR             │
    │                    ├─ SmartCrusher   (JSON)         │
    │                    ├─ CodeCompressor (AST)          │
    │                    └─ Kompress-v2-base (text, HF)   │
    │                                                     │
    │  Cross-agent memory  ·  headroom learn  ·  MCP      │
    └────────────────────────────────────────────────────┘
        │   compressed prompt  +  retrieval tool
        ▼
 LLM provider  (Anthropic · OpenAI · Bedrock · …)

The moving parts, per the README:

ContentRouter — detects the content type and selects the right compressor.
SmartCrusher — universal JSON: arrays of dicts, nested objects, mixed types.
CodeCompressor — AST-aware for Python, JS/TS, Go, Rust, Java, C/C++, and Perl.
Kompress-v2-base — Headroom's HuggingFace model, trained on agentic traces, for prose.
CacheAligner — stabilizes prompt prefixes so provider KV caches actually hit.
CCR — reversible compression: originals are stored locally and pulled back on demand.

2. Reversible compression (CCR)

CCR is what makes aggressive compression safe. Instead of discarding detail, Headroom keeps the originals in a local cache and exposes a retrieval tool. When the model decides it needs the full text, it calls headroom_retrieve and gets the original back. Retrieval is bounded by a configurable TTL, so the cache does not grow forever.

The practical implication: you can compress hard without permanently losing information the model might later need.

3. Content-aware compressor modules

Module	Handles	Notes
SmartCrusher	JSON	Arrays of dicts, nested objects, mixed types
CodeCompressor	Source code	AST-aware across many languages
Kompress-v2-base	Prose / text	ML model trained on agentic traces
Image compression	Images	Project reports 40–90% reduction via a trained ML router

4. The four modes

Library

Call compress(messages) inline, in Python or TypeScript, anywhere in your own code. This gives you the most control and is the right choice when you own the request path. See Library and SDK integration.

Proxy

headroom proxy --port 8787 starts a local, OpenAI-compatible proxy. Point any client at it and get compression with zero code changes, in any language. This is the fastest way to measure savings. See Proxy and agent wrapping.

Agent wrap

headroom wrap <tool> wires a supported coding agent through Headroom in one command (and headroom unwrap <tool> undoes it). Supported tools per the README include claude, codex, copilot, cursor, aider, opencode, cline, continue, goose, openhands, openclaw, and vibe. Coverage and per-agent behavior are documented in the agent matrix.

MCP server

Headroom exposes MCP tools — headroom_compress, headroom_retrieve, and headroom_stats — for any MCP client. Install with headroom mcp install. This lets an MCP-native agent compress and retrieve context as explicit tool calls.

5. Cross-agent memory

Beyond per-request compression, Headroom keeps a shared memory store across agents (the README names Claude, Codex, and Gemini) with agent provenance and automatic deduplication. SharedContext().put / .get passes compressed context across multi-agent workflows. This is the feature that turns Headroom from a per-call optimizer into a cross-session context layer.

6. Request lifecycle

Headroom exposes one stable request lifecycle across compress(), the SDK, and the proxy:

Setup → Pre-Start → Post-Start → Input Received → Input Cached
→ Input Routed → Input Compressed → Input Remembered
→ Pre-Send → Post-Send → Response Received

Extensions can observe or customize any stage via on_pipeline_event(...), and provider- or tool-specific behavior lives under headroom/providers/ so the core stays focused on sequencing and policy. If you need to hook custom logic into compression, this lifecycle is the seam to target.

1. The pipeline​

2. Reversible compression (CCR)​

3. Content-aware compressor modules​

4. The four modes​

Library​

Proxy​

Agent wrap​

MCP server​

5. Cross-agent memory​

6. Request lifecycle​

7. References​