Extraction Pipeline¶
CRP's 6-stage graduated extraction pipeline converts raw LLM output into structured, atomic facts. Each stage adds increasing analytical depth at increasing computational cost.
Pipeline Overview¶
graph LR
A[Raw Text] --> B[Stage 1<br/>Regex<br/>~1ms]
B --> C[Stage 2<br/>Statistical<br/>~5ms]
C --> D[Stage 3<br/>GLiNER NER<br/>~50ms]
D --> E[Stage 4<br/>UIE Relations<br/>~100ms]
E --> F[Stage 5<br/>Discourse<br/>~150ms]
F --> G[Stage 6<br/>LLM-Assisted<br/>1 LLM call]
G --> H[Fact Graph]
Stages¶
Stage 1 — Regex Extraction (~1 ms)¶
Pattern-based extraction of structured data:
- Key-value pairs (
key: value) - Definitions (
X is defined as Y) - Numeric facts, dates, URLs
- Code blocks and inline code
- List items and enumerations
Always runs. Near-zero cost.
Stage 2 — Statistical NLP (~5 ms)¶
TextRank-based extraction of important sentences:
- Sentence importance scoring
- Noun phrase extraction
- Term frequency analysis
- Co-occurrence patterns
Always runs. Catches facts that regex misses.
Stage 3 — GLiNER Named Entity Recognition (~50 ms)¶
Neural NER using GLiNER models:
- Person, organization, location entities
- Technical terms, software names
- Domain-specific entities
- Entity linking and deduplication
Runs when Stage 2 yield is low.
Stage 4 — UIE Relational Extraction (~100 ms)¶
Universal Information Extraction for relationships:
- Subject-predicate-object triples
- Cause-effect relationships
- Dependency chains
- Temporal sequences
Runs when Stage 3 yield is low.
Stage 5 — Discourse Structure (~150 ms)¶
Identifies document-level patterns:
- Argument structure
- Rhetorical relations
- Topic boundaries
- Logical flow
Runs for reasoning-dense content.
Stage 6 — LLM-Assisted Relational (~1 LLM call)¶
Uses the LLM itself to extract complex relationships:
- Multi-hop reasoning chains
- Implicit relationships
- Domain-specific ontology mapping
Off by default. Enable via configuration.
Content Complexity Routing¶
CRP automatically detects content type and selects appropriate stages:
| Content Type | Stages Used | Example |
|---|---|---|
| ENTITY_RICH | 1 → 4 | API documentation, reference material |
| REASONING_DENSE | 1 → 6 | Research papers, analysis |
| NARRATIVE | 1 → 5 | Reports, articles, guides |
Quality Gate¶
Every extracted fact passes a 3-tier quality gate:
- Structural validation — Well-formed, complete, no truncation
- Confidence scoring — Statistical confidence meets threshold
- Anomaly detection — Outliers flagged for review
Facts that fail validation are discarded or demoted.
Fact Graph¶
Extracted facts are stored as nodes in a typed graph:
graph TD
A["Kubernetes uses etcd<br/>for state storage"] -->|depends_on| B["etcd is a distributed<br/>key-value store"]
A -->|cause_effect| C["Pod state is persisted<br/>across restarts"]
B -->|condition| D["Requires odd number<br/>of cluster members"]
Edge Types¶
| Type | Meaning |
|---|---|
depends_on |
Fact A requires Fact B |
cause_effect |
Fact A causes Fact B |
condition |
Fact A is conditional on Fact B |
contradicts |
Fact A conflicts with Fact B |
supersedes |
Fact A replaces Fact B |
elaborates |
Fact A adds detail to Fact B |
Event-Sourced Fact Model¶
Facts use an append-only event log — nothing is deleted, only superseded:
- Full temporal query support (what did we know at window N?)
- Complete audit trail for compliance
- State reconstruction from any point
- Automatic snapshots every 50 windows
Singleton Model Registry¶
CRP shares model instances across subsystems:
- One
all-MiniLM-L6-v2instance (80 MB) shared by envelope builder, CKF, and extraction - Process-wide singleton — no duplicate loading
- Lazy initialization on first use
API¶
from crp.extraction import ExtractionPipeline, detect_content_complexity
# The pipeline runs automatically during dispatch.
# For manual use:
result = client.ingest(
raw_text="Kubernetes is an open-source container orchestration...",
source_label="k8s-docs",
)
print(f"Facts extracted: {result.facts_extracted}")
print(f"Fact IDs: {result.fact_ids}")