Architecture and Modes
Component names, the pipeline diagram, and the lifecycle stages below are quoted from the Headroom README. Internal behavior may change between releases β verify against the architecture docs before depending on specifics.
1. The pipelineβ
Headroom runs locally and processes context in a fixed order before it reaches the provider:
Your agent / app
(Claude Code, Cursor, Codex, LangChain, your own codeβ¦)
β prompts Β· tool outputs Β· logs Β· RAG results Β· files
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Headroom (runs locally β your data stays here) β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β CacheAligner β ContentRouter β CCR β
β ββ SmartCrusher (JSON) β
β ββ CodeCompressor (AST) β
β ββ Kompress-v2-base (text, HF) β
β β
β Cross-agent memory Β· headroom learn Β· MCP β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β compressed prompt + retrieval tool
βΌ
LLM provider (Anthropic Β· OpenAI Β· Bedrock Β· β¦)
The moving parts, per the README:
- ContentRouter β detects the content type and selects the right compressor.
- SmartCrusher β universal JSON: arrays of dicts, nested objects, mixed types.
- CodeCompressor β AST-aware for Python, JS/TS, Go, Rust, Java, C/C++, and Perl.
- Kompress-v2-base β Headroom's HuggingFace model, trained on agentic traces, for prose.
- CacheAligner β stabilizes prompt prefixes so provider KV caches actually hit.
- CCR β reversible compression: originals are stored locally and pulled back on demand.
2. Reversible compression (CCR)β
CCR is what makes aggressive compression safe. Instead of discarding detail, Headroom keeps the originals in a local cache and exposes a retrieval tool. When the model decides it needs the full text, it calls headroom_retrieve and gets the original back. Retrieval is bounded by a configurable TTL, so the cache does not grow forever.
The practical implication: you can compress hard without permanently losing information the model might later need.
3. Content-aware compressor modulesβ
| Module | Handles | Notes |
|---|---|---|
| SmartCrusher | JSON | Arrays of dicts, nested objects, mixed types |
| CodeCompressor | Source code | AST-aware across many languages |
| Kompress-v2-base | Prose / text | ML model trained on agentic traces |
| Image compression | Images | Project reports 40β90% reduction via a trained ML router |
4. The four modesβ
Libraryβ
Call compress(messages) inline, in Python or TypeScript, anywhere in your own code. This gives you the most control and is the right choice when you own the request path. See Library and SDK integration.
Proxyβ
headroom proxy --port 8787 starts a local, OpenAI-compatible proxy. Point any client at it and get compression with zero code changes, in any language. This is the fastest way to measure savings. See Proxy and agent wrapping.
Agent wrapβ
headroom wrap <tool> wires a supported coding agent through Headroom in one command (and headroom unwrap <tool> undoes it). Supported tools per the README include claude, codex, copilot, cursor, aider, opencode, cline, continue, goose, openhands, openclaw, and vibe. Coverage and per-agent behavior are documented in the agent matrix.
MCP serverβ
Headroom exposes MCP tools β headroom_compress, headroom_retrieve, and headroom_stats β for any MCP client. Install with headroom mcp install. This lets an MCP-native agent compress and retrieve context as explicit tool calls.
5. Cross-agent memoryβ
Beyond per-request compression, Headroom keeps a shared memory store across agents (the README names Claude, Codex, and Gemini) with agent provenance and automatic deduplication. SharedContext().put / .get passes compressed context across multi-agent workflows. This is the feature that turns Headroom from a per-call optimizer into a cross-session context layer.
6. Request lifecycleβ
Headroom exposes one stable request lifecycle across compress(), the SDK, and the proxy:
Setup β Pre-Start β Post-Start β Input Received β Input Cached
β Input Routed β Input Compressed β Input Remembered
β Pre-Send β Post-Send β Response Received
Extensions can observe or customize any stage via on_pipeline_event(...), and provider- or tool-specific behavior lives under headroom/providers/ so the core stays focused on sequencing and policy. If you need to hook custom logic into compression, this lifecycle is the seam to target.