On a Pro Max 5x (Opus) plan, quota resets at a fixed interval. After reset, with moderate usage (mostly Q&A, light development), quota was exhausted within 1.5 hours. Prior to reset, 5 hours of heavy development (multi-file implementation, graphify pipeline, multi-agent spawns) consumed the previous quota window
Investigation reveals the likely root cause: cache_read tokens appear to count at full rate against the rate limit, negating the cost benefit of prompt caching for quota purposes.
All data extracted from ~/.claude/projects/*//*.jsonl session files, specifically the usage object on each API response:
{
"cache_read_input_tokens": ...,
"cache_creation_input_tokens": ...,
"input_tokens": ...,
"output_tokens": ...
}| Metric | Value |
|---|---|
| API calls | 2,715 |
| Cache read | 1,044M tokens |
| Cache create | 16.8M tokens |
| Input tokens | 8.9k tokens |
| Output tokens | 1.15M tokens |
| Peak context | 966,078 tokens |
| Effective input (cache_read at 1/10) | 121.8M tokens |
Workload: Full feature implementation (Express server + iOS app), graphify knowledge graph pipeline, SPEC-driven multi-agent coordination. 2 auto-compacts as context hit ~960k.
Main session (vibehq):
| Metric | Value |
|---|---|
| API calls | 222 |
| Cache read | 23.2M tokens |
| Cache create | 1.4M tokens |
| Input tokens | 304 tokens |
| Output tokens | 91k tokens |
| Peak context | 182,302 tokens |
| Effective input (cache_read at 1/10) | 2.8M tokens |
Other sessions running (background, not actively used by user):
| Session | API Calls | Cache Read | Eff Input | Output |
|---|---|---|---|---|
| token-analysis | 296 | 57.6M | 6.5M | 145k |
| career-ops | 173 | 23.1M | 3.8M | 148k |
| Total (all sessions) | 691 | 103.9M | 13.1M | 387k |
Window 2 total: 13.1M effective tokens in 1.5 hours = 8.7M/hr
This should NOT exhaust a Pro Max 5x quota. For comparison, Window 1 consumed 24.4M effective tokens/ho
Window 2 total: 103.9M + 1.4M + 387k = 105.7M tokens in 1.5 hours = 70.5M/hr
This would explain quota exhaustion, but means prompt caching provides zero benefit for rate limiting.
The session file shows context growing and compacting cyclically:
Segment 1: 32k → 783k (835 calls) → auto-compact
Segment 2: 39k → 966k (1,842 calls) → auto-compact
Segment 3: 55k → 182k (222 calls) → still active
Each API call sends the full context as input. With a 1M context window, calls near the compact threshold send ~960k tokens each. Even with prompt caching, if cache_read counts at full rate against quota, a single call costs ~960k quota tokens.
Expected: cache_read tokens should count at reduced rate (1/10) against rate limits, matching the reduced cost.
Observed: Quota exhaustion rate is consistent with cache_read counting at full rate.
Impact: On a 1M context window, each API call sends ~100-960k tokens. With 200+ calls per hour (normal for tool-heavy Claude Code usage), quota depletes in minutes regardless of caching.
Sessions left open in other terminals continue making API calls (compacts, retros, hook processing) even when the user is not actively interacting. These consume from the same quota pool.
In this case, token-analysis (296 calls) and career-ops (173 calls) were running without active user interaction but still consuming significant quota.
Each auto-compact event results in one API call with the full pre-compact context (~966k tokens) as cache_creation, followed by a fresh start. This means the most expensive single call happens automatically, without user action.
Larger context window = more tokens per call = faster quota depletion. The 1M window is marketed as a feature but becomes counterproductive when cache_read tokens count at full rate against quota.
~/.claude/rules/ with ~30 rule files (~19k tokens fixed overhead)/context commandThe token usage cannot be consumed at this speed.
It's hard to reproduct but I can provdie the log.
None
Yes, this worked in a previous version
No response
v2.1.97
Anthropic API
Ubuntu/Debian Linux
WSL (Windows Subsystem for Linux)
No response