Local Models & LM Studio¶
CRP works with any local LLM — no API keys, no cloud, no cost. This guide covers setting up local models with LM Studio, Ollama, llama.cpp, and vLLM.
LM Studio (Recommended for Beginners)¶
LM Studio provides a GUI for downloading and running local models. CRP's own benchmarks were run on LM Studio.
Step 1: Install LM Studio¶
Download from lmstudio.ai for Windows, macOS, or Linux.
Step 2: Download a Model¶
- Open LM Studio
- Go to the Discover tab
- Search for a model. Recommended starting points:
| Model | Size | VRAM Needed | Best For |
|---|---|---|---|
qwen3-4b |
~2.5 GB | 4 GB | Testing, low-resource machines |
llama-3.1-8b-instruct |
~4.5 GB | 6 GB | General use, good quality |
mistral-7b-instruct-v0.3 |
~4 GB | 6 GB | Fast inference, good quality |
qwen3-14b |
~8 GB | 10 GB | Higher quality, needs decent GPU |
llama-3.1-70b-instruct |
~40 GB | 48+ GB | Best quality, needs high-end GPU |
- Click Download on your chosen model
Step 3: Start the Local Server¶
- Go to the Developer tab (or Local Server in older versions)
- Select your downloaded model
- Configure context length:
- Set Context Length (
n_ctx) — this is the model's context window - Start with
4096for testing, increase to8192or32768for production
- Set Context Length (
- Click Start Server
- The server starts on
http://localhost:1234/v1(OpenAI-compatible API)
Context length matters
CRP's continuation engine shines when the context window is small relative
to the task. With n_ctx=4096 and a 30-section document request, CRP will
chain 8-10 windows and produce 10-12x more content than a single call.
Step 4: Connect CRP to LM Studio¶
from crp import Client
from crp.providers import OpenAIAdapter
client = Client(
provider=OpenAIAdapter(
model="qwen3-4b", # Must match LM Studio model name
base_url="http://localhost:1234/v1", # LM Studio default
api_key="lm-studio", # Any non-empty string works
),
)
Step 5: Run a Dispatch¶
output, report = client.dispatch(
system_prompt="You are a technical writer.",
task_input="Write a comprehensive guide to container orchestration.",
)
print(f"Words: {len(output.split())}")
print(f"Quality: {report.quality_tier}")
print(f"Windows: {report.continuation_windows}")
Step 6: Run the Benchmark¶
python examples/benchmark_continuation.py \
--base-url http://localhost:1234/v1 \
--model qwen3-4b \
--api-key lm-studio \
--max-tokens 2048 \
--context-size 4096
LM Studio Tips¶
Thinking models
Models like qwen3-4b use internal <think> reasoning that consumes
~51% of the token budget invisibly. A non-thinking model (like
llama-3.1-8b-instruct) produces ~2x more visible output per window.
- GPU offloading: In LM Studio settings, set GPU layers to maximum for best speed. If you run out of VRAM, reduce by 5 layers at a time.
- Flash Attention: Enable if your GPU supports it (most modern NVIDIA GPUs).
- Batch size: Leave at default (512) unless you're running multiple sessions.
- Temperature: CRP passes temperature to the model. Default is usually fine.
Ollama¶
import crp
# Auto-detects running Ollama on localhost:11434
client = crp.Client(model="llama3.1")
output, report = client.dispatch(
system_prompt="You are a code reviewer.",
task_input="Review this function for security issues: ..."
)
Custom Ollama Server¶
from crp import Client
from crp.providers import OllamaAdapter
client = Client(provider=OllamaAdapter(
base_url="http://192.168.1.100:11434", # Remote server
model="codellama",
))
llama.cpp¶
from crp import Client
from crp.providers import LlamaCppAdapter
client = Client(provider=LlamaCppAdapter(
server_url="http://localhost:8080"
))
vLLM / TGI / Any OpenAI-Compatible Server¶
Any server exposing an OpenAI-compatible API works with CRP's OpenAIAdapter:
Custom Function (Any LLM)¶
For any LLM not covered above:
from crp import Client
from crp.providers import CustomProvider
def my_llm(messages, **kwargs):
# Call your LLM however you want
# Return (output_text, finish_reason)
return ("response text", "stop")
client = Client(provider=CustomProvider(
generate_fn=my_llm,
count_tokens_fn=lambda text: len(text) // 4,
context_size=8192,
))
Choosing a Model¶
| Use Case | Recommended Model | Why |
|---|---|---|
| First test / low resources | qwen3-4b |
Small, fast, good enough to see CRP work |
| General development | llama-3.1-8b-instruct |
Non-thinking model, good quality/speed ratio |
| Quality-sensitive tasks | qwen3-14b or llama-3.1-70b |
Better instruction following |
| Code tasks | codellama-13b or deepseek-coder-v2 |
Code-optimized |
| Benchmarking CRP | llama-3.1-8b-instruct |
Non-thinking = clean benchmark data (no <think> tax) |
Notes¶
- CRP auto-detects the model's context window size from the provider
- All extraction runs locally — no data leaves your machine
- The embedding model (
all-MiniLM-L6-v2, ~80 MB) downloads automatically on first use - First dispatch takes ~10-15 seconds for model loading; subsequent calls are fast