Context Sources — input-side provenance¶
Available since CRP 2.1. Enforcement pipeline + ledger in CRP 2.2. Provider hooks, derived manifests, turn-level enforcement, and ledger hash-chain in CRP 2.3.
CRP has always tracked where the model's output came from. The
Decision Provenance Engine classifies every generated
claim as CONTEXT_GROUNDED, PARAMETRIC, MIXED, or UNCERTAIN.
That's half the picture. Regulators ask the mirror question:
"Where did the input come from? Which of these facts was pulled from a vector DB? Which was a tool response? Which was typed by the user? Which did the model already know?"
Context-source provenance answers that — as a first-class protocol primitive, not a downstream annotation.
Why this matters
- ISO/IEC 42001 §4.1–4.2 — Context of the organisation. An auditor will ask for your source inventory on Day One.
- EU AI Act Art. 10 — Data and data governance. You must document the data used to operate each high-risk system.
- GDPR Art. 30 — Records of processing activities. Every personal-data source must be recorded.
- NIST AI RMF MAP 4 — Impacts are mapped when the organisation maps its context.
The primitive¶
Four types in crp.core.context_source:
SourceKind¶
A closed enumeration of 14 upstream categories:
| Kind | Meaning |
|---|---|
USER_TURN |
End-user's direct chat input |
SYSTEM_PROMPT |
Developer-authored system prompt |
DEVELOPER_PROMPT |
OpenAI-style role=developer message |
RAG_RETRIEVAL |
Generic RAG chunk |
VECTOR_DB |
Pinecone / Weaviate / pgvector / etc. |
DATABASE |
Relational / NoSQL read |
KNOWLEDGE_GRAPH |
Neo4j / RDF / structured graph |
MCP_TOOL |
Model Context Protocol server call |
FUNCTION_CALL |
Function-calling result |
WEB_SEARCH |
Live web search |
FILE_UPLOAD |
User-uploaded document |
AGENT_MEMORY |
Agent-framework memory store |
CKF_RETRIEVAL |
CRP Contextual Knowledge Fabric (internal) |
WARM_STORE |
CRP warm-store fact (internal) |
PARAMETRIC |
Model-internal knowledge |
UNATTESTED |
Detected but not declared — audit signal |
Closed intentionally. New kinds land via RFC so auditors don't receive novel strings unannounced.
ContextSource¶
Frozen dataclass. Attach to any Fact or message.
src = ContextSource(
kind=SourceKind.VECTOR_DB,
source_id="acme-hr-policies-vdb",
origin=SourceOrigin.OBSERVED,
trust_level=TrustLevel.TRUSTED,
contains_pii=True,
region="eu-west-1",
retrieval_query="redundancy policy 2024",
retrieved_at=time.time(),
)
origin has three values: DECLARED (from a signed manifest),
OBSERVED (plumbed in by the caller who knows the true origin), or
HEURISTIC (inferred by the detective-mode parser).
ContextManifest¶
Customer-authored declaration of intended sources, HMAC-SHA256 signed.
from crp import ContextManifest
manifest = ContextManifest(system_id="resume-rank-v1", customer_id="acme")
manifest.add(ContextSource(
kind=SourceKind.VECTOR_DB,
source_id="acme-hr-policies-vdb",
contains_pii=True,
region="eu-west-1",
))
manifest.add(ContextSource(
kind=SourceKind.DATABASE,
source_id="acme-postgres-applicants",
contains_pii=True,
region="eu-west-1",
))
manifest.sign(secret=os.environ["CRP_MANIFEST_SECRET"].encode())
blob = manifest.to_json() # persist / ship to proxy
restored = ContextManifest.from_json(blob)
assert restored.verify(os.environ["CRP_MANIFEST_SECRET"].encode())
Signature verification uses hmac.compare_digest (constant-time).
check_attestation¶
Compares observed sources against the manifest and returns a list of
AttestationMismatch rows.
from crp import check_attestation
mismatches = check_attestation(observed_sources, manifest)
for m in mismatches:
audit_log.emit(m.to_audit_event()) # CONTEXT_ATTESTATION_MISMATCH
Reasons reported:
no_manifest— observed source seen but no manifest registeredmanifest_expired— manifest pastexpires_atunattested_kind— observedkindnot in declared kindsunattested_source_id— kind declared butsource_idmismatch
Benign kinds (USER_TURN, SYSTEM_PROMPT, DEVELOPER_PROMPT,
PARAMETRIC) are exempt from "no manifest" mismatches.
Detective mode — heuristic parsing¶
When upstream code doesn't attach a ContextSource,
detect_source_kind(content, role=…) classifies the message using a
conservative pattern library plus the OpenAI-style role hint.
from crp import detect_source_kind
detect_source_kind("You are helpful.", role="system").kind
# → SourceKind.SYSTEM_PROMPT
detect_source_kind("<RAG>chunk 1</RAG>", role="user").kind
# → SourceKind.RAG_RETRIEVAL
detect_source_kind("SELECT name FROM users").kind
# → SourceKind.DATABASE
Detective-mode results always have origin=HEURISTIC and
trust_level=UNKNOWN. They surface in audit reports for review — they
are never treated as authoritative.
Fact.source¶
The extraction pipeline's Fact gains an optional source field.
from crp import Fact, ContextSource, SourceKind
fact = Fact(
text="The redundancy policy requires 60 days' notice.",
source=ContextSource(
kind=SourceKind.VECTOR_DB,
source_id="acme-hr-policies-vdb",
contains_pii=False,
),
)
The field defaults to None, so every v2.0 caller continues to work
without modification.
Envelope integration¶
[CONTEXT_SOURCES] is now a recognised Tier-3 section in the envelope
formatter. When the envelope builder packs facts that carry source
records, the source manifest can be rendered as a dedicated section for
the model to reason about — or stripped for production prompts and
retained only in the audit log.
See Context Envelope for the 9-section layout.
Error codes¶
| Code | Name | Meaning |
|---|---|---|
| 1040 | CONTEXT_ATTESTATION_MISMATCH |
Observed source not declared in manifest |
| 1041 | CONTEXT_MANIFEST_INVALID |
Manifest failed to parse or verify |
Complete example¶
import os
import time
from crp import (
Client, Fact, ContextSource, ContextManifest,
SourceKind, SourceOrigin, TrustLevel,
detect_source_kind, check_attestation,
)
# 1. Declare your sources up front.
manifest = ContextManifest(system_id="triage-bot", customer_id="acme")
manifest.add(ContextSource(
kind=SourceKind.VECTOR_DB, source_id="policy-vdb",
contains_pii=False, region="eu-west-1",
trust_level=TrustLevel.TRUSTED,
))
manifest.sign(os.environ["CRP_MANIFEST_SECRET"].encode())
# 2. Tag facts as they are retrieved.
rag_chunk = my_vector_db.query("redundancy policy")
fact = Fact(
text=rag_chunk.text,
source=ContextSource(
kind=SourceKind.VECTOR_DB,
source_id="policy-vdb",
origin=SourceOrigin.OBSERVED,
retrieval_query="redundancy policy",
retrieved_at=time.time(),
),
)
# 3. Heuristically detect anything that slipped through unattested.
detected = detect_source_kind(some_tool_response, role="tool")
# 4. Audit the session.
observed = [fact.source, detected]
for mismatch in check_attestation(observed, manifest):
audit_log.emit(mismatch.to_audit_event())
Enforcement pipeline (CRP 2.2)¶
2.1 defined the vocabulary. 2.2 defines the wire-side choke-point every envelope assembly must flow through when a manifest is attached.
from crp import (
ContextEnforcer,
EnforcementPolicy,
InMemoryAuditSink,
RotatingKeyProvider,
ManifestLedger,
set_default_enforcer,
)
enforcer = ContextEnforcer(
policy=EnforcementPolicy.REJECT, # OBSERVE | WARN | REJECT
sink=InMemoryAuditSink(), # or LoggingAuditSink, or your own
key_provider=RotatingKeyProvider(initial=SECRET), # HMAC keys w/ grace window
ledger=ManifestLedger(), # append-only JSONL per session
session_id="sess-abc",
require_signed_manifest=True,
)
set_default_enforcer(enforcer) # process-wide default; opt-in
Every call to assemble_messages(manifest=…, observed_sources=…)
now runs the pipeline before any message is constructed:
- Manifest signature verify (via
KeyProvider.candidates()— rotation aware) - Manifest expiry check
- Attestation mismatch scan (observed sources vs. declared)
- Injection-signal scan on content from
TRUSTEDsources - Auto-record to the ledger on success
Under REJECT, violations raise CRPError(CONTEXT_ATTESTATION_MISMATCH)
or CRPError(CONTEXT_MANIFEST_INVALID) with the offending source_ids
named in the message. Under OBSERVE / WARN, audit events are emitted
to the sink and assembly continues.
Injection-signal detection (CRP 2.2)¶
Trust labels are declarative. Marking a SYSTEM_PROMPT as TRUSTED
doesn't make its content safe if the prompt was templated from user
input. CRP 2.2 scans TRUSTED sources for six high-precision patterns:
| ID | Severity | Example |
|---|---|---|
instruction_override |
high | "ignore all previous instructions" |
role_jailbreak |
high | "you are now DAN / developer mode / jailbroken" |
exfil_secret |
high | "reveal your system prompt / api keys" |
delimiter_forgery |
medium | [END VERIFIED CONTEXT], <|im_start|> |
payload_url |
medium | data:, javascript:, file: URIs |
embedded_tool_call |
medium | <tool_call>, <function_call> |
Hits are surfaced as CONTEXT_TRUST_VIOLATION audit events. Untrusted
sources are not scanned by default (untrusted content is expected to
contain arbitrary text — scanning it produces noise).
Manifest ledger (CRP 2.2)¶
Cross-turn continuity. Every verified manifest is appended to
crp_sessions/<session_id>.manifest.jsonl:
from crp import ManifestLedger
ledger = ManifestLedger()
# Find every turn where a source was declared:
entries = ledger.find_by_source_id("acme-hr-policies-vdb", session_id="sess-abc")
# Find every session that touched a particular kind of source:
entries = ledger.find_by_kind(SourceKind.WEB_SEARCH, session_id="sess-abc")
# Periodic integrity audit:
bad = ledger.verify_signatures("sess-abc", secret=current_key)
# Discover every session on disk (for cold-start audits):
for sid in ledger.scan_sessions():
...
Session IDs are sanitized to [A-Za-z0-9_-] — identifiers that sanitize
to empty are rejected (prevents directory traversal).
Key management (CRP 2.2)¶
Two KeyProvider implementations; integrators plug in their own for
KMS / Vault / HSM:
from crp import EnvVarKeyProvider, RotatingKeyProvider
# Minimum viable: env var (hex auto-detect, ≥32 byte minimum)
kp = EnvVarKeyProvider("CRP_MANIFEST_SECRET")
# Rotation with grace window
kp = RotatingKeyProvider(initial=current_key)
kp.rotate(new_key) # old key still verifies in-flight manifests
kp.retire_all() # drop every retired key
The enforcer calls kp.verify(manifest) which tries candidates() in
order (current first, retired next). hmac.compare_digest is used for
each comparison — no timing side channel between candidates.
Retrieval auto-stamping (CRP 2.2)¶
Two retrieval surfaces automatically stamp provenance on facts that
arrive without an explicit source:
- Warm store (
WarmStateStore.get_active_facts_as_extraction) —kind=WARM_STORE,trust_level=UNKNOWN. - CKF (
ContextualKnowledgeFabric.retrieve) —kind=CKF_RETRIEVAL,trust_level=UNKNOWN, with retrievalmodesandscoreinmetadata.
Trust is deliberately UNKNOWN — CRP cannot infer upstream trust at
retrieval time. Integrators who have vetted their warm-store contents
should set fact.source explicitly before the fact enters the store.
Provider/framework auto-hooks (CRP 2.3)¶
Callers who hand-build OpenAI / Anthropic requests, or use
LangChain/LlamaIndex/Semantic Kernel directly, previously bypassed the
ContextEnforcer entirely. CRP 2.3 fixes that with transparent hooks:
from crp.integrations import install_openai_hook, install_anthropic_hook
from crp.integrations.langchain_hook import CRPCallbackHandler
# Monkey-patches openai.OpenAI.chat.completions.create (sync + async).
install_openai_hook(enforcer=my_enforcer)
# Same for anthropic.Anthropic.messages.create.
install_anthropic_hook(enforcer=my_enforcer)
# LangChain: add as a callback to any LLM/Chain/AgentExecutor.
chain.invoke(..., callbacks=[CRPCallbackHandler(enforcer=my_enforcer)])
Every bypass path now flows through the same ledger and the same policy.
Derived manifests (CRP 2.3)¶
Not every integrator will stamp sources. CRP 2.3 can derive a manifest from observed bytes, distinguishing lazy-integrator from genuinely ephemeral content:
from crp.core.manifest_derive import derive_manifest_from_content
manifest = derive_manifest_from_content(
content=retrieved_text,
retrieval_query="What is policy X?",
upstream_uri="vector://kb/doc-42",
)
Derived manifests carry origin=DERIVED and default TrustLevel.UNKNOWN
— they never silently upgrade trust. They do flow through the ledger so
OBSERVE-mode deployments get audit coverage even without explicit stamps.
Turn-level enforcement (CRP 2.3)¶
The enforcer now re-validates every turn, including tool_result
injections mid-conversation:
# Full message list (typical chat-completions shape)
enforcer.check_messages(messages, observed=sources)
# Single tool_result about to re-enter the prompt
enforcer.check_tool_result(tool_call_id, content, source=...)
No more one-shot-at-assembly-time gap.
Ledger hash chain + SIEM forwarding (CRP 2.3)¶
ManifestLedger entries now include prev_hash + entry_hash. Tampering
with any entry breaks the chain, detectable via:
SIEM / syslog / JSONL replicators plug in via the forward_to= kwarg:
from crp.core.ledger_backends import JsonlFileSink, SyslogSink
ledger = ManifestLedger(
session_dir="crp_sessions",
forward_to=[
JsonlFileSink("/var/log/crp/ledger.jsonl"),
SyslogSink(host="siem.internal", port=514),
],
)
Forwarding never blocks a ledger write; failing sinks are logged and skipped.
Default observe enforcer (CRP 2.3)¶
assemble_messages now auto-installs an OBSERVE-mode enforcer if the
caller forgot to configure one. Previously, forgetting set_default_enforcer
meant silent no-op. Now you always get at least the ledger trail.
Custom / local endpoints (CRP 2.3)¶
OpenAIAdapter accepts empty or None api_key when base_url is set.
Works out-of-the-box with LM Studio, vLLM, llama.cpp server, Ollama
OpenAI-compat, TGI, and any unauthenticated local endpoint:
from crp.providers import OpenAIAdapter
adapter = OpenAIAdapter(
model="gemma-3-270m-it-qat",
api_key="", # or None
base_url="http://192.168.0.6:1234/v1",
)
Roadmap¶
- ~~Protocol-level
[CONTEXT_SOURCES]envelope section renderer and budget accounting (2.1.x)~~ — shipped in 2.1 - ~~Dispatch router auto-attaches
ContextSourceto tool-role messages (2.2.0)~~ — shipped in 2.2 - ~~Provider/framework auto-hooks, derived manifests, turn-level enforcement, ledger hash-chain (2.3.0)~~ — shipped in 2.3
crp-complyconsumesContextSource+ the manifest ledger to emit the Context Analysis deliverable (ISO 42001 §4.1–4.2) and GDPR Art. 30 RoPA. Coming in Comply 1.x.
See the CHANGELOG for the complete 2.3.0 note.