Skip to content

Benchmarks

CRP's benchmark suite measures real-world performance of the continuation engine, extraction pipeline, and protocol overhead.

Headline Results

At a 2,048-token generation limit, a single LLM call produces 592 words and is cut off after 8 of 30 requested sections. CRP produces 6,993 words across 25 sections with a proper conclusion — an 11.8x content multiplier at only 6.1% protocol overhead.

Direct LLM CRP-Orchestrated Multiplier
Words 592 6,993 11.8x
Characters 4,640 52,740 11.4x
Sections (of 30) 8 25 3.1x
Paragraphs 8 99 12.4x
Conclusion present No Yes
Truncated Yes No
Quality tier A

Model: qwen3-4b (4B parameter thinking model). Hardware: consumer PC, LM Studio, n_ctx = 4,096.

What This Benchmark Measures

When an LLM hits its output-token wall, what happens with and without CRP?

The test gives a model a task requiring ~20,000 tokens (a 30-section technical document), but limits each call to 2,048 output tokens. Without CRP the output is truncated. With CRP the continuation engine detects the wall, extracts facts, packs them into an envelope, and dispatches continuation windows until done.

This is NOT a latency benchmark. CRP trades wall-clock time for task completion — the same way a human re-reads notes before continuing a long essay.

Protocol Efficiency

CRP adds minimal overhead on top of raw LLM generation:

Component Time % of Total
LLM generation (9 windows) 1,342.4 s 93.9%
Envelope build (fact ranking + packing) 66.2 s 4.6%
Orchestration logic 20.3 s 1.4%
Extraction pipeline 0.125 s 0.009%
Total CRP overhead 86.6 s 6.1%

The extraction pipeline runs in 8–12 ms per window. Envelope build grows linearly as the fact base expands — this is the dominant CRP cost.

Per-Window Telemetry

Each continuation window operates independently:

Window LLM (s) Extract (ms) Envelope (s) Output Tokens Reasoning Tokens Facts Saturation
1 133.9 8 4.9 1,064 955 38 1.011
2 132.9 11 5.7 1,088 921 43 0.939
3 131.9 12 5.0 1,104 901 36 0.997
4 132.2 8 6.4 826 1,184 33 0.961
5 133.3 9 7.7 1,181 827 44 1.008
6 136.8 8 9.3 637 1,314 26 1.021
7 165.6 6 7.5 869 1,141 32 1.014
8 127.7 8 9.6 932 1,075 33 1.007
9 134.8 8 10.2 1,010 998 43 0.992

Key observations:

  • LLM time is stable (~133 s/window) — CRP does not introduce growing cost
  • Extraction is free — 6–12 ms per window (< 0.01% of window time)
  • Envelope build grows linearly — more facts → slightly longer ranking

Envelope Saturation

$$\text{saturation} = \frac{\text{envelope tokens packed}}{\text{envelope budget}}$$

Observed range: 0.939 – 1.021 (mean: 0.994)

Near 1.0 means the budget formula is well-calibrated. Each window packs 9–15 compressed facts from a pool of 26–44 extracted per window.

Thinking-Model Tax

The qwen3-4b model spends ~51% of its token budget on internal <think> reasoning that never appears in the output:

Token Type Count % of Generation
Output (visible) 9,836 49.1%
Reasoning (<think>) 10,219 50.9%

A non-thinking model would produce ~2x more content per window, needing ~4–5 windows instead of 9.

Throughput Parity

Metric Direct LLM CRP
Time 119.7 s 1,429.1 s
Content 592 words 6,993 words
Throughput 4.9 words/s 4.9 words/s
Task completed? No Yes

CRP takes 12x longer but produces 12x more content — effective throughput is identical. The difference: CRP finishes the task.

Performance Regression Tests

test_benchmarks.py contains 12 tests with specific targets:

Benchmark Target
Cold session init < 200ms
Warm session init < 50ms
Dispatch overhead < 100ms
Envelope assembly < 50ms
Ingest throughput > 100 facts/sec
Cache hit < 1ms
Event emission < 5ms
Metrics export < 10ms

Run them:

python -m pytest tests/test_benchmarks.py -v

Additional Benchmarks

Benchmark What It Shows
Streaming latency Time-to-first-token for dispatch_stream()
Session persistence Knowledge accumulation across dispatch() calls
Multi-agent pipeline 3 agents sharing one CRP session
Security hardening Injection detection, integrity verification
Model scaling Same task across 1B, 4B, 7B, 13B models
Non-thinking vs thinking Expected 2x throughput gain without <think>
Token budget sweep 512/1024/2048/4096 tokens, plot windows vs content

See Reproduce Benchmarks for how to run these yourself.