Benchmarks¶
CRP's benchmark suite measures real-world performance of the continuation engine, extraction pipeline, and protocol overhead.
Headline Results¶
At a 2,048-token generation limit, a single LLM call produces 592 words and is cut off after 8 of 30 requested sections. CRP produces 6,993 words across 25 sections with a proper conclusion — an 11.8x content multiplier at only 6.1% protocol overhead.
| Direct LLM | CRP-Orchestrated | Multiplier | |
|---|---|---|---|
| Words | 592 | 6,993 | 11.8x |
| Characters | 4,640 | 52,740 | 11.4x |
| Sections (of 30) | 8 | 25 | 3.1x |
| Paragraphs | 8 | 99 | 12.4x |
| Conclusion present | No | Yes | — |
| Truncated | Yes | No | — |
| Quality tier | — | A | — |
Model: qwen3-4b (4B parameter thinking model). Hardware: consumer PC, LM Studio, n_ctx = 4,096.
What This Benchmark Measures¶
When an LLM hits its output-token wall, what happens with and without CRP?
The test gives a model a task requiring ~20,000 tokens (a 30-section technical document), but limits each call to 2,048 output tokens. Without CRP the output is truncated. With CRP the continuation engine detects the wall, extracts facts, packs them into an envelope, and dispatches continuation windows until done.
This is NOT a latency benchmark. CRP trades wall-clock time for task completion — the same way a human re-reads notes before continuing a long essay.
Protocol Efficiency¶
CRP adds minimal overhead on top of raw LLM generation:
| Component | Time | % of Total |
|---|---|---|
| LLM generation (9 windows) | 1,342.4 s | 93.9% |
| Envelope build (fact ranking + packing) | 66.2 s | 4.6% |
| Orchestration logic | 20.3 s | 1.4% |
| Extraction pipeline | 0.125 s | 0.009% |
| Total CRP overhead | 86.6 s | 6.1% |
The extraction pipeline runs in 8–12 ms per window. Envelope build grows linearly as the fact base expands — this is the dominant CRP cost.
Per-Window Telemetry¶
Each continuation window operates independently:
| Window | LLM (s) | Extract (ms) | Envelope (s) | Output Tokens | Reasoning Tokens | Facts | Saturation |
|---|---|---|---|---|---|---|---|
| 1 | 133.9 | 8 | 4.9 | 1,064 | 955 | 38 | 1.011 |
| 2 | 132.9 | 11 | 5.7 | 1,088 | 921 | 43 | 0.939 |
| 3 | 131.9 | 12 | 5.0 | 1,104 | 901 | 36 | 0.997 |
| 4 | 132.2 | 8 | 6.4 | 826 | 1,184 | 33 | 0.961 |
| 5 | 133.3 | 9 | 7.7 | 1,181 | 827 | 44 | 1.008 |
| 6 | 136.8 | 8 | 9.3 | 637 | 1,314 | 26 | 1.021 |
| 7 | 165.6 | 6 | 7.5 | 869 | 1,141 | 32 | 1.014 |
| 8 | 127.7 | 8 | 9.6 | 932 | 1,075 | 33 | 1.007 |
| 9 | 134.8 | 8 | 10.2 | 1,010 | 998 | 43 | 0.992 |
Key observations:
- LLM time is stable (~133 s/window) — CRP does not introduce growing cost
- Extraction is free — 6–12 ms per window (< 0.01% of window time)
- Envelope build grows linearly — more facts → slightly longer ranking
Envelope Saturation¶
$$\text{saturation} = \frac{\text{envelope tokens packed}}{\text{envelope budget}}$$
Observed range: 0.939 – 1.021 (mean: 0.994)
Near 1.0 means the budget formula is well-calibrated. Each window packs 9–15 compressed facts from a pool of 26–44 extracted per window.
Thinking-Model Tax¶
The qwen3-4b model spends ~51% of its token budget on internal <think>
reasoning that never appears in the output:
| Token Type | Count | % of Generation |
|---|---|---|
| Output (visible) | 9,836 | 49.1% |
Reasoning (<think>) |
10,219 | 50.9% |
A non-thinking model would produce ~2x more content per window, needing ~4–5 windows instead of 9.
Throughput Parity¶
| Metric | Direct LLM | CRP |
|---|---|---|
| Time | 119.7 s | 1,429.1 s |
| Content | 592 words | 6,993 words |
| Throughput | 4.9 words/s | 4.9 words/s |
| Task completed? | No | Yes |
CRP takes 12x longer but produces 12x more content — effective throughput is identical. The difference: CRP finishes the task.
Performance Regression Tests¶
test_benchmarks.py contains 12 tests with specific targets:
| Benchmark | Target |
|---|---|
| Cold session init | < 200ms |
| Warm session init | < 50ms |
| Dispatch overhead | < 100ms |
| Envelope assembly | < 50ms |
| Ingest throughput | > 100 facts/sec |
| Cache hit | < 1ms |
| Event emission | < 5ms |
| Metrics export | < 10ms |
Run them:
Additional Benchmarks¶
| Benchmark | What It Shows |
|---|---|
| Streaming latency | Time-to-first-token for dispatch_stream() |
| Session persistence | Knowledge accumulation across dispatch() calls |
| Multi-agent pipeline | 3 agents sharing one CRP session |
| Security hardening | Injection detection, integrity verification |
| Model scaling | Same task across 1B, 4B, 7B, 13B models |
| Non-thinking vs thinking | Expected 2x throughput gain without <think> |
| Token budget sweep | 512/1024/2048/4096 tokens, plot windows vs content |
See Reproduce Benchmarks for how to run these yourself.