Cut Claude Code's input tokens by rendering bulky context as images — the same system prompt, tool docs, and history, in a fraction of the tokens.
An image's token cost is fixed by its pixel dimensions, not by how much text is inside it. Dense content (code, JSON, tool output) packs ~3.1 chars per image-token vs ~1 char per text-token on real Claude Code traffic. pxpipe is a local proxy that exploits that gap: it rewrites the bulky parts of your request (system prompt, tool docs, older history) into compact PNGs before the request leaves your machine.
Savings are workload-dependent — pxpipe wins on token-dense content and
leaves sparse/small requests untouched — so these are measured snapshots, not
constants. The primary, durable result is input-token reduction: dense
system prompts, tool docs, and history go in as compact images instead of text
(the example above is ≈25k text tokens rendered as ≈2.7k image tokens), every
request measured against its own count_tokens counterfactual. Dollars are
downstream of that — at current Fable list prices the token cut lands as a
~59–70% lower end-to-end bill (~72–74% on compressed requests; full pricing
math in the FAQ). But list prices can change tomorrow and the token count
won't, so tokens — not dollars — are the number to watch. Reproduce both from
~/.pxpipe/events.jsonl.
This is what the model sees instead of text:
~48k characters of system prompt + tool docs (this repo's own README,
FINDINGS, and source), ≈25k tokens as text, ≈2.7k image tokens as this page.
Produced by the real transformRequest pipeline: whitespace-minified, reflowed
into full rows with ↵ marking original newlines, OCR instruction banner
co-rendered on top. The model reads renders like this at 100/100 on a clean
eval (see benchmarks).
Fable 5 demo (the default, 100/100 reader):
Fable-AB-Demo.mp4
- Both demos with both panes on Fable 5 (plain left, pxpipe right).
- Fable reads what Opus can't. The imaged phrase-count that Opus refuses (see the Opus demo below): the pxpipe arm counts the exact token 10/10 across 39 imaged filler files (matches
grepground truth line-for-line) and gets the multi-step ledger arithmetic right (8037 → … → 15,021). - Same answers, ~7× cheaper. Session totals after both demos: plain $42.21, context 96% full (964.5k/1M — one task away from forced compaction) vs pxpipe $6.06 with context to spare (73.5k/1M).
- Honest caveat, visible in the clip: the pxpipe arm answered the count first and needed one follow-up nudge to also print the ledger balance in the requested one-line format; the plain arm followed the format on the first try. Legibility is solved on Fable — single-reply format compliance is the remaining rough edge.
Opus 4.8 demo (Opus disabled by default):
Opus-AB-Demo.mp4
Side-by-side — plain Claude (left) vs pxpipe (right), both on Opus 4.8 (opt-in; pxpipe is tuned for Fable — see the Fable clip above). Click the image to watch (Google Drive).
- Demo 1 — fix a failing test suite: both pass; the dashboard shows pxpipe cut the request to a fraction of the tokens (real, server-measured context/token reduction).
- Demo 2 — a big file-context (40 files, ~382k tokens) plus a math question and a "count this phrase" task: the math answer (a small text needle) reads on both. The phrase-count needs reading the imaged filler — so pxpipe-on-Opus can't read it and honestly surfaces that it won't fabricate a number (the documented lossy limit: exact values stay text). Plain, meanwhile, bogs down counting file-by-file.
npx pxpipe-proxy # proxy on 127.0.0.1:47821 ANTHROPIC_BASE_URL=http://localhost:47821 claude # point Claude Code at it
Open http://127.0.0.1:47821/ for a live dashboard: tokens saved, per-session stats, every text→image conversion side by side, a global kill switch, and runtime model chips including GPT 5.6 and GPT 5.5.
Nothing else changes. Responses stream normally; pxpipe only compresses the request (your context going up), never the model's output. Recent turns stay text; the system prompt, tool docs, and older bulk history are imaged.
It is lossy. pxpipe is a gist tier, not a lossless store. In a needle-in-haystack eval, exact 12-char hex strings inside dense imaged content came back 0/15 on Opus and 13/15 on Fable 5, and the failure mode is silent confabulation: a plausible wrong value, not an error. Anything you need back byte-exact (IDs, hashes, secrets, exact numbers) must stay text. Recent turns do; a dedicated verbatim-risk guard is not built yet.
Exact-recall escape hatch. pxpipe only images Fable requests
(PXPIPE_MODELS=claude-fable-5), so any subagent on a non-Fable model passes
through as text. Route work that needs byte-exact values to one — globally with
CLAUDE_CODE_SUBAGENT_MODEL=claude-sonnet-4-6, or per-agent with model: sonnet
in the agent frontmatter. It reads from source (file/JSONL), not the imaged
history. This covers exact-recall you route on purpose; it does not catch a
silent misread you did not expect — that is the unbuilt guard above.
Does it break real work? Parity in what we measured: a 10-instance SWE-bench Lite pilot (the easy subset) resolved 10/10 on both arms, pxpipe ON at $27 vs OFF at $54 token-equivalent, and 19 SWE-bench Pro pairs (harder, long-horizon) resolved 14/19 ON vs 15/19 OFF at -60% per-request: verdicts agree on 18/19, and the single split (one ON fail) re-resolved 3/3 when replicated, i.e. run-to-run agentic variance, not compression. Small n, details and caveats below.
Savings are workload-dependent. It wins on token-dense content (~1 char/token: code, JSON, hashes) and loses money on sparse English prose (~3.5 chars/token). The built-in gate only images content where the math wins, calibrated against N=391 production rows.
Model scope: one PXPIPE_MODELS CSV controls which model bases get imaged
across both families — default claude-fable-5,gpt-5.6 (GPT 5.5 is opt-in;
it degrades on imaged context). Set
PXPIPE_MODELS=off to disable imaging entirely, or use
~/.config/pxpipe/config.json with { "models": "off" } (or a list). For GPT,
pxpipe keeps tool definitions in native JSON (only verbose schema prose moves
into the image) so tool-calling stays reliable; unlike the Claude path, the GPT
path does not add or depend on Anthropic cache_control prompt-cache markers.
The dashboard chips can flip any model live without changing client configs.
Opus 4.7/4.8 was the original Claude scope but misread ~7% of renders
(10200→9400), so it was turned off by default once Fable 5 hit 100/100 with
identical image billing — opt it back in at your own risk via PXPIPE_MODELS or
the dashboard chips. Everything else passes through untouched.
Measured with novel random-number problems the model cannot have memorized:
| test | N | text | pxpipe (image) | tokens |
|---|---|---|---|---|
novel arithmetic, claude-fable-5 |
100 | 100% | 100% | −38% |
novel arithmetic, claude-opus-4-8 |
100 | 100% | 93% | −38% |
| gist recall A/B (decisions, values, paths, names, negations; with distractors; 15k-45k char sessions), Fable 5 | 98/arm | 98/98 | 98/98 | - |
| state tracking (value mutated 3x, final/first/count), Fable 5 | 18/arm | 18/18 | 18/18 | - |
| confabulation on never-stated facts (lower is better), Fable 5 | 16/arm | 0/16 | 0/16 | - |
| verbatim 12-char hex recall, dense render, Opus | 15 | 15/15 | 0/15 | - |
| verbatim 12-char hex recall, dense render, Fable 5 | 15 | - | 13/15 | - |
10 SWE-bench Lite instances, Claude Code + Fable 5, paired runs through
pxpipe ON vs OFF, graded with the official swebench Docker harness:
| pxpipe ON | OFF | |
|---|---|---|
| resolved | 10/10 | 10/10 |
| request size vs own uncompressed body | −65% | ±0 |
The −65% is per-request (count_tokens probe of each body before
compression), so it has no turn-count confound. n=10/arm, Lite skews easy.
Run totals, receipts, caveats: eval/swe-bench/.
19 completed pairs across two runs (2 dropped: checkout failed both
arms), same setup, official SWE-bench_Pro-os Docker harness:
| pxpipe ON | OFF | |
|---|---|---|
| resolved | 14/19 | 15/19 |
| request size vs own uncompressed body | −60% | ±0 |
Verdicts agree on 18/19 (three instances failed both arms, one with
byte-identical patches across arms). The single split (navidrome, ON
fail) was replicated 3x on the ON arm: all three runs produced an
identical patch and resolved, so the original loss was run-to-run
agentic variance, not compression. Receipts:
eval/swe-bench-pro/.
We also ran GSM8K: 96% imaged. But GSM8K is in training data, so the model
recalls memorized answers through its own misreads, inflating the score, so we
lead with the clean novel-number eval instead. Reproduce:
eval/gsm8k/ · eval/needle-haystack/ ·
eval/gist-recall/ ·
full analysis in FINDINGS.md.
Is the headline end-to-end, or only on the requests you touched? End-to-end, the whole bill. Most compression tools report savings only on the input slice they touched, which flatters the number. The end-to-end denominator is every production request: the small ones pxpipe correctly left untouched, all cache writes and reads, and all output tokens (which the proxy never compresses). On a 13,709-request snapshot that was 59% ($100 → ~$41); a later 8,904-compressed-request trace measured ~70%. Compressed-only runs higher (~72–74%) and is quoted separately, never as the headline. The exact figure is workload-dependent — reproduce it on your own log.
How is the math measured?
Both sides of the same request, at the same moment. For every /v1/messages
POST the proxy fires a free count_tokens probe on the original uncompressed
body (the counterfactual) in parallel with the real forward, and reads
Anthropic's actually-billed usage block off the response. Both land in the
same row of ~/.pxpipe/events.jsonl, so there is no turn-count or
run-to-run confound. Dollar conversion uses Fable 5 list ratios: input ×1.0,
cache write ×1.25, cache read ×0.1, output ×5. Cache pricing is applied
identically to both sides, so the caching discount cancels and cannot be
double-counted as "savings". Re-derive it yourself from the events log: the
formula and field names are documented in src/core/baseline.ts.
What does it actually compress? Three kinds of input blocks, each behind a profitability gate:
- large
tool_resultbodies (file reads, command output, logs) above ~6k chars of token-dense content - older collapsed history: turns behind the live tail get re-rendered as image pages, recent turns always stay text
- the static system prompt + tool docs slab
Everything else passes through byte-identical: your messages, recent turns, the model's output (it is the response, the proxy never touches it), sparse prose, and anything too small to win. Non-Fable models pass through entirely.
Has it ever failed for real, outside the benchmarks? Yes, once in weeks of daily use: the model recalled a person's name from imaged chat history and got it confidently wrong. No error, just a plausible wrong name. That is the documented failure mode: exact strings in imaged content are not byte-safe. Coding sessions tolerate this because the agent re-reads files before editing; pure chat recall has no such check.
tool_result string ──► wrap at 1928px-wide columns ──► pack ~92,000 chars/page ──► PNG[]
The proxy intercepts /v1/messages, rewrites eligible bulk history into image
blocks, splices them back cache-friendly (static prefix preserved, so prompt
caching keeps working), and forwards. Per-request events log to
~/.pxpipe/events.jsonl.
The economics: a 1928×1928 image costs ≈4,761 vision tokens and holds up to
≈92,000 chars (≈48,000 text tokens at the observed density), so plain text is
cheaper only when it runs denser than ~19 chars/token. Claude Code transcripts
are far below that (observed 1.91 chars/token, N=391). The runtime estimator (estimateImageCount) plus a chars/token gate
decides per-request; sparse prose is left as text.
Same engine, no proxy. Render text → PNGs, or run the full cache-safe transform:
import { renderTextToPngs, transformAnthropicMessages } from "pxpipe"; const imgs = await renderTextToPngs(toolResultText); // RenderedImage[] const { body, applied, info } = await transformAnthropicMessages({ body: requestBytes, model: "claude-fable-5", });
options.keepSharp(block) pins blocks as text (override the heuristic for IDs,
hashes, paths); options.emitRecoverable returns the originals of imaged blocks
so a stateful caller can recover them — the two halves of the fidelity contract
for the lossy limitation below. Runtime is pure-JS (Node and edge/Workers);
@napi-rs/canvas is build-time only. Full API, types, and constants:
src/core/index.ts.
pnpm install && pnpm test # 376 tests pnpm run build # regenerates dist/
- Lossy: see "the honest part" above. Verbatim recall from images is unreliable.
- Render latency: encoding PNGs adds time to large requests before they leave (partly offset by the model ingesting fewer tokens). Responses stream normally.
- ASCII/Latin-1 well tested; CJK works but conservatively.
- Runtime is pure-JS — runs on Node and edge/Workers.
@napi-rs/canvasis a build-time-only dev dep (regenerating the glyph atlas), not a runtime dep. - Fable 5 only.
Everything above is measured. Everything here is not. These are hypotheses, not claims; they ship as numbers with an n or they get cut.
- Sharper glyphs. The 13/15 verbatim gap is partly font legibility, not just
the model. A per-char confusion matrix across render styles is paused mid-run
(
eval/glyph-matrix/); if a zero-cost style lowers read error, the gate compresses harder at the same fidelity. - Effective context. Dense text carries at ~3x fewer tokens as images. If that holds in the live window and not just the bill, 1M tokens holds ~2x the real content. Open question: can a task needing ~2M raw context run inside Fable's 1M once the bulk is imaged?
- Less active text, sharper model. Long contexts degrade reasoning as they fill. Imaging old bulk shrinks what the model actively reads while keeping it reachable. Hypothesis: same information, smaller active context, better long-task accuracy.
One bet: longer effective context and a sharper model on long tasks, from the same Fable 5. Numbers or retraction, no hype between.
MIT.
