CRP-SPEC-009: Contextual Knowledge Fabric (CKF) Specification¶

Document: CRP-SPEC-009
Title: Context Relay Protocol (CRP) - Contextual Knowledge Fabric
Version: 3.0.0
Status: Draft
Author: Constantinos Vidiniotis, AutoCyber AI Pty Ltd
Contact: contact@crprotocol.io
Date: 2026-05-25
License: CC BY 4.0 (specification text)
Prerequisites: CRP-SPEC-001, CRP-SPEC-003

Abstract¶

This document specifies the Contextual Knowledge Fabric (CKF) - CRP's persistent knowledge graph that serves as Tier 3 (cold storage) in the four-tier memory hierarchy. The CKF stores facts as graph-embedded nodes with vector representations, connected by semantic edges, and organised into communities via the Leiden algorithm. It is the long-term memory from which Context Envelopes are assembled, and the persistence layer that enables cross-session knowledge reuse, conditional dispatch via ETag caching, and knowledge staleness tracking.

1. Architecture Overview¶

┌─────────────────────────────────────────────────────────┐
│                 Contextual Knowledge Fabric              │
│                                                          │
│  ┌───────────┐   ┌───────────┐   ┌───────────────────┐ │
│  │ Fact Store │   │ HNSW Index│   │ Community Graph   │ │
│  │ (nodes)   │◄──│ (vectors) │──►│ (Leiden clusters) │ │
│  └───────────┘   └───────────┘   └───────────────────┘ │
│        │                                    │            │
│  ┌─────▼──────┐                   ┌────────▼────────┐  │
│  │ Source      │                   │ State Hash      │  │
│  │ Registry   │                   │ (ETag source)   │  │
│  └────────────┘                   └─────────────────┘  │
└─────────────────────────────────────────────────────────┘

2. Fact Node Schema¶

Each fact in the CKF is stored as a node with the following fields:

FactNode {
  fact_id:             string       // UUID - immutable after creation
  content:             string       // The fact text (1–2048 tokens max)
  content_hash:        string       // SHA-256 of content (change detection)
  embedding:           float[]      // Vector embedding (model-specific dimensionality)
  source_id:           string       // Reference to originating document in Source Registry
  source_location:     string       // Page number, paragraph, URL fragment
  importance_weight:   float        // 0.0–1.0, intrinsic to the fact
  community_label:     string       // Leiden community assignment
  ingested_at:         ISO 8601     // Timestamp of initial ingestion
  modified_at:         ISO 8601     // Timestamp of last modification
  access_count:        integer      // Number of times included in an envelope
  last_accessed_at:    ISO 8601     // Timestamp of most recent envelope inclusion
  ttl:                 ISO 8601 dur // Optional time-to-live (e.g., P90D)
  status:              enum         // ACTIVE | STALE | DELETED | QUARANTINED
  metadata:            map          // Arbitrary key-value pairs
}

2.1 Fact Size Constraints¶

Minimum content length: 10 tokens
Maximum content length: 2,048 tokens
Facts exceeding 2,048 tokens MUST be chunked during ingestion (see §4)

2.2 Fact Status Lifecycle¶

ACTIVE ──→ STALE (ttl expired, manual flag, or source document updated)
ACTIVE ──→ DELETED (source document removed, GDPR erasure request)
ACTIVE ──→ QUARANTINED (flagged by DPE as producing fabrication/distortion)
STALE  ──→ ACTIVE (re-ingested from updated source)
STALE  ──→ DELETED (manual cleanup)
DELETED ──→ (permanent removal from HNSW index after 30 days)

Facts with status STALE MAY still be retrieved during envelope construction but receive a freshness penalty in the ranking phase (CRP-SPEC-003 §6.4). Facts with status DELETED or QUARANTINED MUST NOT be retrieved.

3. HNSW Vector Index¶

3.1 Purpose¶

The Hierarchical Navigable Small World (HNSW) index provides sub-linear approximate nearest-neighbour search over fact embeddings, enabling fast retrieval during envelope construction Phase 1 (CRP-SPEC-003 §5).

3.2 Configuration Parameters¶

Parameter	Default	Description
`M`	16	Max connections per node per layer
`ef_construction`	200	Size of dynamic candidate list during construction
`ef_search`	100	Size of dynamic candidate list during search
`distance_metric`	`cosine`	Distance function for similarity
`dimensions`	Model-dependent	Embedding vector dimensionality (e.g., 1536 for text-embedding-3-small)

3.3 Index Maintenance¶

New facts are added to the index immediately upon ingestion
Deleted facts are marked in a deletion set and excluded from search results; physical removal occurs during periodic index compaction
Modified facts (content_hash changed) trigger re-embedding and index update
Index is rebuilt fully if >20% of nodes have been modified/deleted since last rebuild

3.4 Multi-Model Embeddings¶

If the CRP deployment changes embedding models (e.g., upgrading from text-embedding-3-small to text-embedding-3-large): - All existing facts MUST be re-embedded with the new model - The HNSW index MUST be rebuilt - During migration, dual indexes MAY run in parallel - The CKF state hash (ETag) changes on embedding model change - all client ETags are invalidated

4. Document Ingestion Pipeline¶

4.1 Ingestion Flow¶

Source Document → Chunking → Fact Extraction → Embedding → HNSW Insert → Community Update

4.2 Chunking Strategy¶

Documents are chunked using a semantic-aware chunking strategy:

Paragraph-level splitting: Split on paragraph boundaries first
Sentence-level refinement: If a paragraph exceeds 512 tokens, split at sentence boundaries
Overlap: 10% token overlap between adjacent chunks to preserve cross-boundary context
Metadata preservation: Each chunk retains: source_id, page number, section heading

4.3 Fact Extraction¶

Each chunk becomes one or more facts: - Simple chunks (single coherent assertion) → 1 fact - Complex chunks (multiple assertions) → decomposed into N facts using the claim segmentation model (same as DPE Stage 1, CRP-SPEC-005 §3)

4.4 Importance Weight Assignment¶

Importance weight is assigned based on source metadata:

Source Type	Base Weight	Modifiers
Regulatory text (laws, standards)	0.90	+0.05 for specific articles
Official documentation	0.80	+0.05 for version-specific content
Peer-reviewed research	0.75	+0.10 for meta-analyses
Internal company documents	0.70	Varies by classification
Web content	0.50	+0.10 for authoritative domains
User-provided context	0.60	Configurable

4.5 Source Registry¶

Every ingested document is recorded in the Source Registry:

SourceRecord {
  source_id:           string       // UUID
  title:               string       
  uri:                 string       // Original document location
  document_hash:       string       // SHA-256 of document content
  ingested_at:         ISO 8601
  fact_count:          integer      // Number of facts extracted
  status:              enum         // ACTIVE | UPDATED | REMOVED
}

When a source document is updated: 1. New version is ingested alongside old version 2. Old facts are marked STALE 3. New facts reference the same source_id with updated content_hash 4. DPE is notified of source changes for cross-window coherence validation

5. Community Detection (Leiden Algorithm)¶

5.1 Purpose¶

The Leiden algorithm clusters semantically related facts into communities. Communities enable: - Targeted retrieval expansion (CRP-SPEC-003 §5.2 Step 1.3) - Knowledge domain identification (emitted as CRP-Memory-CKF-Community) - Diversity scoring in the ranking phase

5.2 Graph Construction¶

Before running Leiden, a similarity graph is constructed: 1. For each fact, find K nearest neighbours in the HNSW index (default K=20) 2. Create edges between facts with cosine similarity ≥ 0.60 3. Edge weight = cosine similarity value

5.3 Leiden Parameters¶

Parameter	Default	Description
`resolution`	1.0	Controls community granularity (higher = more communities)
`n_iterations`	10	Optimisation iterations
`min_community_size`	3	Minimum facts per community

5.4 Community Labels¶

Each community is assigned a human-readable label by sampling the 5 highest-importance facts and extracting the dominant topic via keyword extraction.

5.5 Reclustering Schedule¶

Communities are recomputed: - After every ingestion batch of ≥50 new facts - On a scheduled basis (default: weekly) - When explicitly triggered by the operator - Reclustering changes the CKF state hash → invalidates all client ETags

6. CKF State Hash (ETag Source)¶

6.1 Computation¶

state_components = SORT([
  f.fact_id + ":" + f.content_hash + ":" + f.status
  for f in all_active_facts
])
ckf_state_hash = SHA-256("|".join(state_components))

6.2 Change Detection¶

The state hash changes when: - Any fact is added, modified, or deleted - Any fact's status changes - Community reclustering occurs (labels change)

It does NOT change when: - Facts are accessed (read-only) - Access counts are updated - No structural changes occur

6.3 ETag Emission¶

The CKF state hash is the source for CRP-Context-ETag (CRP-SPEC-002 §4.8).

7. Cache-Control Semantics for CKF¶

7.1 CRP-Context-Cache Directives (CKF-Specific)¶

Directive	CKF Behaviour
`reuse-ckf`	Read from CKF but do not trigger source re-ingestion
`no-store`	Do not write this session's facts to CKF. Read is still permitted.
`no-cache`	Ignore cached envelope; force full Phase 1 retrieval from CKF
`only-if-ckf`	Fail with 424 if CKF has no facts with relevance ≥ 0.50 for this query
`max-age=N`	Facts with `ingested_at` older than N seconds are treated as STALE for ranking

When CRP-Context-Cache: no-store is set: - Session-generated facts are held in Tier 1 (hot session cache) only - On session end, all Tier 1 facts for this session are purged - No CKF graph mutations occur - The CKF state hash does not change

When CRP-Compliance-Data-Residency is set: - CKF reads are restricted to facts stored in the specified region - CKF writes are directed to the specified region's storage

8. Fact Lifecycle Management¶

8.1 TTL-Based Staleness¶

Facts with a ttl field are automatically marked STALE when current_time > ingested_at + ttl. Stale facts receive a freshness penalty in ranking but are not deleted.

8.2 Access-Based Eviction¶

Facts not accessed within a configurable period (default: 180 days) MAY be archived to cold storage (reducing HNSW index size). Archived facts can be restored on demand.

When a customer requests erasure (GDPR Art. 17): 1. All facts with the specified source_id are marked DELETED 2. The HNSW index excludes deleted facts from search results immediately 3. Deleted facts are physically removed from storage within 30 days 4. The CKF state hash is recomputed → all ETags invalidated 5. The audit trail records the erasure event (the fact that erasure occurred is retained; the erased content is not)

8.4 Fact Quarantine¶

If the DPE detects that a specific CKF fact consistently produces fabrications or distortions when included in envelopes (tracked via the audit trail): 1. The fact is marked QUARANTINED 2. Quarantined facts are excluded from retrieval 3. The operator is notified to review the source document 4. Quarantine can be lifted manually after review

9. Multi-Tenant CKF Isolation¶

9.1 Tenant Isolation Model¶

In multi-tenant CRP Gateway deployments: - Each tenant has a separate CKF namespace - HNSW indexes are per-tenant (no cross-tenant vector search) - Community detection runs per-tenant - CKF state hashes are per-tenant - Cross-tenant fact access is architecturally impossible

9.2 Shared Knowledge Bases¶

Operators MAY configure shared CKF namespaces accessible by multiple tenants (e.g., regulatory knowledge). Shared namespaces: - Are read-only for tenants - Are managed by the platform operator - Have their own state hash (included in the combined ETag computation)

10. References¶

CRP-SPEC-001 - Core Protocol Specification
CRP-SPEC-003 - Context Envelope & Packing
CRP-SPEC-015 - Security & Privacy
Malkov, Y. and Yashunin, D. (2018). "Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs"
Traag, V.A., Waltman, L. and van Eck, N.J. (2019). "From Louvain to Leiden: guaranteeing well-connected communities"