Reproduce Benchmarks¶
Step-by-step guide to reproduce CRP's benchmark results on your own hardware.
Prerequisites¶
- Python 3.10+
- An OpenAI-compatible LLM endpoint
- CRP installed:
pip install crprotocol
Option 1: LM Studio (Recommended for Local)¶
Step 1: Install LM Studio¶
Download from lmstudio.ai and install.
Step 2: Download a Model¶
In the Discover tab, search for and download one of:
| Model | Size | VRAM | Good For |
|---|---|---|---|
| qwen3-4b | 2.5 GB | 4 GB | Exact reproduction of published results |
| llama-3.1-8b | 4.9 GB | 6 GB | Non-thinking model, ~2x more output/window |
| qwen3-8b | 4.9 GB | 6 GB | Thinking model, higher quality |
Step 3: Configure Context Length¶
In the My Models tab, select the model and set:
- Context Length (n_ctx):
4096 - GPU Offload: Maximum layers your GPU supports
Thinking models
If using a thinking model (qwen3), expect ~50% of each window's tokens
to be internal <think> reasoning. This is normal — see
Benchmarks.
Step 4: Start the Server¶
Click Start Server in the Developer tab. Default: http://localhost:1234.
Step 5: Run the Benchmark¶
git clone https://github.com/Constantinos-uni/context-relay-protocol.git
cd context-relay-protocol
pip install -e ".[dev]"
python examples/benchmark_continuation.py \
--base-url http://localhost:1234/v1 \
--model qwen3-4b \
--api-key lm-studio \
--max-tokens 2048 \
--max-continuations 10 \
--context-size 4096 \
--sections 30
Option 2: Ollama¶
# Install and pull a model
ollama pull llama3.1
# Run the benchmark
python examples/benchmark_continuation.py \
--base-url http://localhost:11434/v1 \
--model llama3.1 \
--api-key ollama \
--max-tokens 2048 \
--max-continuations 10
Option 3: Cloud API¶
# OpenAI
python examples/benchmark_continuation.py \
--base-url https://api.openai.com/v1 \
--model gpt-4o-mini \
--api-key $OPENAI_API_KEY \
--max-tokens 2048 \
--max-continuations 10
# Anthropic (via OpenAI-compatible proxy)
python examples/benchmark_continuation.py \
--base-url https://api.anthropic.com/v1 \
--model claude-sonnet-4-20250514 \
--api-key $ANTHROPIC_API_KEY \
--max-tokens 2048
Command-Line Options¶
| Flag | Default | Description |
|---|---|---|
--base-url |
http://localhost:1234/v1 |
LLM API endpoint |
--model |
qwen3-4b |
Model name |
--api-key |
lm-studio |
API key |
--max-tokens |
2048 |
Max tokens per window |
--max-continuations |
10 |
Max continuation windows |
--context-size |
4096 |
Model context size |
--sections |
30 |
Number of sections to request |
What to Expect¶
Timing¶
| Setup | Approx. Time | Windows |
|---|---|---|
| 4B thinking model, 4K ctx | ~25 min | 8–10 |
| 7–8B non-thinking, 4K ctx | ~15 min | 4–6 |
| 7–8B thinking, 8K ctx | ~20 min | 6–8 |
| Cloud API (GPT-4o-mini) | ~3 min | 4–6 |
Output Files¶
The benchmark produces:
| File | Format | Contents |
|---|---|---|
_crp_benchmark_report.txt |
Text | Full metrics, telemetry, complete outputs |
_crp_benchmark_report.json |
JSON | Structured telemetry for analysis |
Good Results¶
- Content multiplier ≥ 5x — CRP is delivering value
- Overhead < 10% — protocol cost acceptable
- Envelope saturation 0.85–1.05 — budget formula tuned
- Quality tier A or S — output meets standards
Investigate If¶
- Overhead > 15% — extraction may be expensive; try Stage 3/4 off
- Saturation < 0.5 — context size override wrong; verify with
adapter._context_size - Multiplier < 3x — task fits in one window; increase
--sectionsor decrease--max-tokens - Repeated sections — thinking model restarts; this is model behavior
Comparing Models¶
Run the same benchmark with different models to see how CRP adapts:
# Small thinking model
python examples/benchmark_continuation.py --model qwen3-4b --max-tokens 2048
# Medium non-thinking model
python examples/benchmark_continuation.py --model llama3.1 --max-tokens 2048
# Large model with more context
python examples/benchmark_continuation.py --model qwen3-8b --context-size 8192
Key differences you'll observe:
| Factor | Small Thinking | Medium Non-Thinking | Large |
|---|---|---|---|
| Windows needed | 8–10 | 4–6 | 3–5 |
| Thinking tax | ~50% | 0% | Varies |
| Content quality | Good | Good | Better |
| Section coverage | 80–85% | 85–90% | 90–95% |
| Time | Longest | Medium | Shortest |
CRP adapts automatically — no configuration changes needed between models.