Zum Hauptinhalt springen

Proxy and Agent Wrapping

Source scope as of July 1, 2026

The agent matrix, flags, and environment variables below are quoted from the Headroom README. Agent coverage changes often — check the proxy docs for the current list.

These are the two zero-code-change ways to run Headroom. Both are the recommended starting point for evaluating savings.

1. Proxy mode​

headroom proxy --port 8787

This starts a local, OpenAI-compatible proxy. Point any OpenAI-compatible client at it and every request is compressed before it goes upstream — no application changes, any language. It is also the surface that reports live savings via headroom dashboard.

2. Agent wrapping​

headroom wrap <tool> wires a supported coding agent through Headroom in one command; headroom unwrap <tool> reverses durable wrapping.

headroom wrap claude
headroom unwrap claude

Agent compatibility matrix​

Agentheadroom wrapNotes
Claude Code✅flags: --memory · --code-graph · --1m · --tool-search
Codex✅shares memory with Claude
CursorManual setupstarts proxy and prints base URLs for Cursor settings
Aider✅starts proxy + launches
Copilot CLI✅starts proxy + launches
OpenClaw✅installs as a ContextEngine plugin
OpenCode✅injects config · starts proxy + launches
Cline✅starts proxy + injects config
Continue✅starts proxy + injects config
Goose✅starts proxy + launches
OpenHands✅starts proxy + launches
Mistral Vibe✅starts proxy + launches
Cortex CodeLibrary only60–65% savings (library mode; no wrap)

Any OpenAI-compatible client works via headroom proxy. MCP-native clients use headroom mcp install. Durable wrapping can be undone for claude, copilot, codex, opencode, and openclaw.

3. Output-token reduction​

Everything above shrinks the prompt you send. Headroom can also trim what the model writes back — preambles, restated code, and deep "thinking" on routine steps — which matters because output tokens are billed at a premium on large models.

It is off by default. Turn it on at the proxy:

export HEADROOM_OUTPUT_SHAPER=1
headroom proxy --port 8787

Two mechanisms, per the README:

  • Verbosity steering — appends a short "be terse, don't restate context" note to the end of the system prompt, so your prompt cache still hits.
  • Effort routing — when a turn is just the model resuming after a tool result, it dials thinking effort down; new questions and errors keep full effort.
Set the switches before you wrap

These are read live per request. Set HEADROOM_OUTPUT_SHAPER before headroom wrap, which hot-syncs current settings to the running proxy via a loopback admin call (no restart). On a shared proxy the overrides are global — the last explicit setting wins.

Learn your preferred terseness​

headroom learn --verbosity            # preview what it found (dry run)
headroom learn --verbosity --apply # save it; the proxy uses it from now on

Measure the output savings honestly​

Output savings are counterfactual — Headroom never sees what the model would have written — so it reports an estimate with a confidence range rather than a made-up number:

headroom output-savings
# Reduction: 31.7% (95% CI 27.7% … 35.7%) [estimated]

For a measured number, hold out a control group: export HEADROOM_OUTPUT_HOLDOUT=0.1 leaves 10% of conversations unshaped, and the dashboard labels the result measured or estimated.

4. GitHub Copilot subscription mode (brief)​

Headroom can route GitHub Copilot CLI subscription traffic through the local proxy:

headroom copilot-auth login
headroom wrap copilot --subscription -- --model gpt-4o

For GitHub Enterprise Server or custom-domain deployments, set GITHUB_COPILOT_ENTERPRISE_DOMAIN before launching. The README notes that non-macOS keychain reuse paths are implemented or planned but not fully vetted, so for Docker/CI prefer passing an explicit GITHUB_COPILOT_TOKEN. See the README for the full flow.

5. References​