Benchmarks¶

This page has moved

Comprehensive benchmark documentation is now at Testing & Benchmarks.

Headline Results¶

Tested with qwen3-4b (4K context window, 2048 token generation limit):

Metric	Direct LLM	CRP	Multiplier
Words produced	592	6,993	11.8×
Sections (of 30)	8	25	3.1×
Conclusion	No	Yes	—
Quality tier	—	A	—
Continuation windows	—	9	—

Key finding

CRP produces 11.8× more content from the same model and completes the task. The direct LLM truncates at 8/30 sections with no conclusion.

Protocol Efficiency¶

CRP overhead is negligible — 93.9% of wall-clock time is LLM generation:

Component	Time	% of Total
LLM generation	93.9%	—
Envelope construction	4.9–10.2s	~5.5%
Extraction pipeline	8–12ms/window	<0.01%
Security checks	~202μs/window	<0.001%
Total CRP overhead	—	6.1%

Envelope Saturation¶

CRP uses virtually all available context space:

Window	Saturation
1	0.939
2	0.998
3	1.002
4	1.008
5	1.021
Mean	0.994

Saturation values >1.0 occur due to compression gains (facts packed more efficiently than raw text).

Throughput¶

Effective throughput is identical — 4.9 words/second for both methods:

Metric	Direct LLM	CRP
Wall time	2m 1s	24m 3s
Words/sec	4.9	4.9
Total words	592	6,993

CRP takes 12× longer because it produces 12× more content at the same generation speed.

Thinking Model Tax¶

qwen3-4b spends 51% of tokens on <think> reasoning blocks that are invisible in the final output:

Mode	Thinking tokens	Visible tokens
Standard	0%	100%
Thinking model	~51%	~49%

A non-thinking model would need approximately 4–5 windows instead of 9 for the same output volume.

Scaling Projections¶

Based on measured degradation characteristics:

Scale	Tokens	Est. Windows (128K)	Quality Tier	Effective Context
1×	128K	1	S	100%
10×	1.3M	10	A	>95%
100×	13M	100	B	80–95%
1,000×	130M	1,000	C	60–80%
10,000×	1.3B	10,000	D	<60% (hierarchical: 73%)

Hierarchical dispatch

At 1B+ tokens, serial chaining has ~0% effective context. Hierarchical dispatch preserves 73% — $O(\log N)$ windows instead of $O(N)$.

Running Your Own Benchmarks¶

# Using the demo app with mock provider
python -m examples.demo_app.demo compare --mock

# With a real provider
python -m examples.demo_app.demo compare --provider openai --model gpt-4o

# Full benchmark suite (all strategies)
python -m examples.demo_app.demo full --mock

Benchmark Methodology¶

Task: 30-section technical guide (Kubernetes for Enterprise)
Model: qwen3-4b via Ollama
Context window: 4,096 tokens
Generation limit: 2,048 tokens
Measurement: Wall-clock time, token counts, output analysis
Environment: Consumer hardware (no GPU cluster)

All benchmarks are reproducible. See BENCHMARKS.md for the full methodology and raw data.