Benchmarks¶
This page has moved
Comprehensive benchmark documentation is now at Testing & Benchmarks.
See also: Reproduce Benchmarks | Running Tests
Headline Results¶
Tested with qwen3-4b (4K context window, 2048 token generation limit):
| Metric | Direct LLM | CRP | Multiplier |
|---|---|---|---|
| Words produced | 592 | 6,993 | 11.8× |
| Sections (of 30) | 8 | 25 | 3.1× |
| Conclusion | No | Yes | — |
| Quality tier | — | A | — |
| Continuation windows | — | 9 | — |
Key finding
CRP produces 11.8× more content from the same model and completes the task. The direct LLM truncates at 8/30 sections with no conclusion.
Protocol Efficiency¶
CRP overhead is negligible — 93.9% of wall-clock time is LLM generation:
| Component | Time | % of Total |
|---|---|---|
| LLM generation | 93.9% | — |
| Envelope construction | 4.9–10.2s | ~5.5% |
| Extraction pipeline | 8–12ms/window | <0.01% |
| Security checks | ~202μs/window | <0.001% |
| Total CRP overhead | — | 6.1% |
Envelope Saturation¶
CRP uses virtually all available context space:
| Window | Saturation |
|---|---|
| 1 | 0.939 |
| 2 | 0.998 |
| 3 | 1.002 |
| 4 | 1.008 |
| 5 | 1.021 |
| Mean | 0.994 |
Saturation values >1.0 occur due to compression gains (facts packed more efficiently than raw text).
Throughput¶
Effective throughput is identical — 4.9 words/second for both methods:
| Metric | Direct LLM | CRP |
|---|---|---|
| Wall time | 2m 1s | 24m 3s |
| Words/sec | 4.9 | 4.9 |
| Total words | 592 | 6,993 |
CRP takes 12× longer because it produces 12× more content at the same generation speed.
Thinking Model Tax¶
qwen3-4b spends 51% of tokens on <think> reasoning blocks that are
invisible in the final output:
| Mode | Thinking tokens | Visible tokens |
|---|---|---|
| Standard | 0% | 100% |
| Thinking model | ~51% | ~49% |
A non-thinking model would need approximately 4–5 windows instead of 9 for the same output volume.
Scaling Projections¶
Based on measured degradation characteristics:
| Scale | Tokens | Est. Windows (128K) | Quality Tier | Effective Context |
|---|---|---|---|---|
| 1× | 128K | 1 | S | 100% |
| 10× | 1.3M | 10 | A | >95% |
| 100× | 13M | 100 | B | 80–95% |
| 1,000× | 130M | 1,000 | C | 60–80% |
| 10,000× | 1.3B | 10,000 | D | <60% (hierarchical: 73%) |
Hierarchical dispatch
At 1B+ tokens, serial chaining has ~0% effective context. Hierarchical dispatch preserves 73% — $O(\log N)$ windows instead of $O(N)$.
Running Your Own Benchmarks¶
# Using the demo app with mock provider
python -m examples.demo_app.demo compare --mock
# With a real provider
python -m examples.demo_app.demo compare --provider openai --model gpt-4o
# Full benchmark suite (all strategies)
python -m examples.demo_app.demo full --mock
Benchmark Methodology¶
- Task: 30-section technical guide (Kubernetes for Enterprise)
- Model: qwen3-4b via Ollama
- Context window: 4,096 tokens
- Generation limit: 2,048 tokens
- Measurement: Wall-clock time, token counts, output analysis
- Environment: Consumer hardware (no GPU cluster)
All benchmarks are reproducible. See BENCHMARKS.md for the full methodology and raw data.