Skip to content

Benchmarks

This page has moved

Comprehensive benchmark documentation is now at Testing & Benchmarks.

See also: Reproduce Benchmarks | Running Tests


Headline Results

Tested with qwen3-4b (4K context window, 2048 token generation limit):

Metric Direct LLM CRP Multiplier
Words produced 592 6,993 11.8×
Sections (of 30) 8 25 3.1×
Conclusion No Yes
Quality tier A
Continuation windows 9

Key finding

CRP produces 11.8× more content from the same model and completes the task. The direct LLM truncates at 8/30 sections with no conclusion.

Protocol Efficiency

CRP overhead is negligible — 93.9% of wall-clock time is LLM generation:

Component Time % of Total
LLM generation 93.9%
Envelope construction 4.9–10.2s ~5.5%
Extraction pipeline 8–12ms/window <0.01%
Security checks ~202μs/window <0.001%
Total CRP overhead 6.1%

Envelope Saturation

CRP uses virtually all available context space:

Window Saturation
1 0.939
2 0.998
3 1.002
4 1.008
5 1.021
Mean 0.994

Saturation values >1.0 occur due to compression gains (facts packed more efficiently than raw text).

Throughput

Effective throughput is identical — 4.9 words/second for both methods:

Metric Direct LLM CRP
Wall time 2m 1s 24m 3s
Words/sec 4.9 4.9
Total words 592 6,993

CRP takes 12× longer because it produces 12× more content at the same generation speed.

Thinking Model Tax

qwen3-4b spends 51% of tokens on <think> reasoning blocks that are invisible in the final output:

Mode Thinking tokens Visible tokens
Standard 0% 100%
Thinking model ~51% ~49%

A non-thinking model would need approximately 4–5 windows instead of 9 for the same output volume.

Scaling Projections

Based on measured degradation characteristics:

Scale Tokens Est. Windows (128K) Quality Tier Effective Context
128K 1 S 100%
10× 1.3M 10 A >95%
100× 13M 100 B 80–95%
1,000× 130M 1,000 C 60–80%
10,000× 1.3B 10,000 D <60% (hierarchical: 73%)

Hierarchical dispatch

At 1B+ tokens, serial chaining has ~0% effective context. Hierarchical dispatch preserves 73% — $O(\log N)$ windows instead of $O(N)$.

Running Your Own Benchmarks

# Using the demo app with mock provider
python -m examples.demo_app.demo compare --mock

# With a real provider
python -m examples.demo_app.demo compare --provider openai --model gpt-4o

# Full benchmark suite (all strategies)
python -m examples.demo_app.demo full --mock

Benchmark Methodology

  • Task: 30-section technical guide (Kubernetes for Enterprise)
  • Model: qwen3-4b via Ollama
  • Context window: 4,096 tokens
  • Generation limit: 2,048 tokens
  • Measurement: Wall-clock time, token counts, output analysis
  • Environment: Consumer hardware (no GPU cluster)

All benchmarks are reproducible. See BENCHMARKS.md for the full methodology and raw data.