You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This parent design defines the high-level architecture and shared contracts for the Neo4j GraphRAG system — an opt-in graph+vector search engine that enriches the existing Postgres-based semantic search pipeline. Neo4j operates as a completely independent feature set: when NEO4J_URI is set, the entire graph pipeline activates (entity extraction, relationship extraction, direct Neo4j writes); when it is NOT set, none of these steps run and no graph data is stored anywhere — not in Postgres, not anywhere. Neo4j is the ONLY place graph data lives. The existing BM25+Vector ensemble search continues unchanged regardless.
The system is model-agnostic: BERT (default, free, local) handles entity extraction and embeddings; any OpenAI-compatible LLM (Ollama, LM Studio, OpenAI, Azure) handles relationship extraction. Every component degrades gracefully when its dependencies are unavailable.
This design establishes a two-tier contract system:
Parent contracts — shared Zod schemas, API interfaces, Neo4j node/relationship shapes, env var conventions that ALL sub-feature specs must conform to
Sub-feature contracts — defined later in each sub-spec for workstream-level boundaries
Architecture
System Diagram
graph TB
subgraph "Client"
UI[Web App]
end
subgraph "Next.js App"
API[API Routes]
Ingestion[Ingestion Pipeline<br/>Steps A-G+]
Ensemble[Ensemble Search<br/>BM25 + Vector + Graph]
Reranker[Reranker Client]
end
subgraph "Data Stores"
PG[(PostgreSQL + pgvector<br/>Source of Truth)]
Neo4j[(Neo4j CE<br/>Graph + Vector<br/>Optional)]
end
subgraph "Sidecar (FastAPI)"
Embed["/embed<br/>BERT Embeddings"]
NER["/extract-entities<br/>BERT NER + CLS Embeddings"]
RelEx["/extract-relationships<br/>Model-Agnostic LLM"]
RerankModel["/rerank<br/>Cross-Encoder"]
end
subgraph "External (Optional)"
Ollama[Ollama / LM Studio / OpenAI<br/>Any OpenAI-Compatible LLM]
EmbAPI[External Embedding API<br/>Optional]
end
UI --> API
API --> Ingestion
API --> Ensemble
Ingestion -->|Steps A-E: always runs| PG
Ingestion -->|"Step F: BERT NER (only when NEO4J_URI set)"| NER
Ingestion -->|"Step F2: Relationship Extraction (only when NEO4J_URI set)"| RelEx
Ingestion -->|"Step G: Direct Write (only when NEO4J_URI set)"| Neo4j
Ensemble -->|BM25 + Vector| PG
Ensemble -->|"Graph Retriever (only when NEO4J_URI set)"| Neo4j
Ensemble --> Reranker --> RerankModel
RelEx -->|OpenAI-compatible API| Ollama
NER -->|Local BERT model| NER
Neo4j -.->|Section content lookup| PG
style Neo4j fill:#e1f5fe,stroke:#0288d1
style PG fill:#e8f5e9,stroke:#388e3c
style Ollama fill:#fff3e0,stroke:#f57c00
Loading
Data Flow: Ingestion Pipeline
flowchart LR
A[Upload] --> B[OCR/Normalize]
B --> C[Chunk]
C --> D[Embed via OpenAI or Sidecar]
D --> E[Store Sections in Postgres]
E --> CHECK{"NEO4J_URI set?"}
CHECK -->|No| DONE[Pipeline Complete<br/>Postgres-only, no graph data anywhere]
CHECK -->|Yes| F["Step F: BERT NER → Entities<br/>(+ CLS Embeddings)"]
F --> F2["Step F2: LLM Relationship Extraction<br/>(gated: EXTRACTION_LLM_BASE_URL)"]
F2 --> G["Step G: Direct Neo4j Write<br/>Entities + Relationships + Sections"]
style G fill:#e1f5fe,stroke:#0288d1
style F fill:#e1f5fe,stroke:#0288d1
style F2 fill:#fff3e0,stroke:#f57c00
style DONE fill:#e8f5e9,stroke:#388e3c
Loading
Key architectural decision:NEO4J_URI gates the ENTIRE graph feature set. When NEO4J_URI is NOT set, Steps F, F2, and G do not run at all — no entity extraction, no relationship extraction, no graph data stored anywhere (not in Postgres kg_* tables, not anywhere). The graph pipeline simply does not exist for users who don't enable Neo4j. When NEO4J_URI IS set, entities and relationships are extracted and written directly to Neo4j — Neo4j is the ONLY place graph data lives. The Postgres kg_* tables are a legacy artifact from the old architecture and are not used by this feature.
Data Flow: Query Pipeline
flowchart LR
Q[User Query] --> E1[BERT NER on Query<br/>Extract Entities + Embedding]
E1 --> ES{Ensemble Search}
ES --> BM25[BM25 Retriever<br/>Postgres]
ES --> Vec[Vector Retriever<br/>pgvector]
ES --> GR["Graph Retriever<br/>Neo4j (when enabled)"]
BM25 --> RRF[Reciprocal Rank Fusion]
Vec --> RRF
GR --> RRF
RRF --> Rerank[Sidecar Cross-Encoder Rerank]
Rerank --> Answer[LLM Answer Generation]
style GR fill:#e1f5fe,stroke:#0288d1
Parallelizable foundations (no dependencies): R1, R2, R3, R4 can all be built simultaneously.
Components and Interfaces
1. Sidecar API Contracts
POST /extract-entities (Enhanced)
Existing endpoint enhanced with optional CLS embeddings. Backward compatible.
// Request (unchanged)interfaceExtractEntitiesRequest{chunks: string[];}// Response when called WITHOUT ?include_embeddings=true (unchanged)interfaceExtractEntitiesResponse{results: {text: string;entities: {text: string;label: string;score: number}[]}[];total_entities: number;}// Response when called WITH ?include_embeddings=true (enhanced)interfaceExtractEntitiesEnhancedResponse{results: {text: string;entities: {text: string;label: string;score: number;embedding: number[];// 768-dim BERT CLS vector}[];}[];total_entities: number;}
POST /extract-relationships (New)
Model-agnostic relationship extraction via any OpenAI-compatible LLM.
interfaceExtractRelationshipsRequest{chunks: string[];known_entities?: string[];// Optional: constrain relationship targets}interfaceExtractionEntity{name: string;type: "PERSON"|"ORGANIZATION"|"LOCATION"|"PRODUCT"|"EVENT"|"OTHER";}interfaceExtractionRelationship{source: string;// Must match an entity nametarget: string;// Must match an entity nametype: string;// SCREAMING_SNAKE_CASE: ^[A-Z][A-Z0-9_]*$detail: string;// Brief evidence description}interfaceExtractionChunkResult{text: string;entities: ExtractionEntity[];relationships: ExtractionRelationship[];dropped_relationships: ExtractionRelationship[];// Invalid source/target}interfaceExtractRelationshipsResponse{results: ExtractionChunkResult[];total_entities: number;total_relationships: number;total_dropped: number;}
2. Embedding Provider Interface
/** Model-agnostic embedding provider — all sub-features use this interface */interfaceEmbeddingProvider{/** Embed a single text string */embed(text: string): Promise<number[]>;/** Embed multiple texts in a batch */embedBatch(texts: string[]): Promise<number[][]>;/** The dimensionality of output vectors */readonlydimensions: number;/** The provider name for logging */readonlyproviderName: string;}/** Factory: reads env vars, returns the appropriate provider */functioncreateEmbeddingProvider(): EmbeddingProvider;// - No EMBEDDING_PROVIDER or EMBEDDING_PROVIDER=bert → BertEmbeddingProvider (768-dim, calls Sidecar /embed)// - EMBEDDING_PROVIDER=openai-compatible → OpenAICompatibleEmbeddingProvider (configurable dim)// - Fallback: if external API unreachable → BertEmbeddingProvider with warning log
These are the runtime-validated contracts that ALL sub-feature specs must conform to. Updated from the old gemma-schemas.ts with model-agnostic naming and new node types.
All Neo4j-related environment variables across all requirements, with validation rules:
// ═══════════════════════════════════════════════════════════════// ENVIRONMENT VARIABLE CONTRACT// ═══════════════════════════════════════════════════════════════//// All variables are OPTIONAL — the system works without any of them.// Each variable gates a specific capability.//// ┌─────────────────────────────────┬──────────┬─────────────────────────────────────────┐// │ Variable │ Required │ Gates │// ├─────────────────────────────────┼──────────┼─────────────────────────────────────────┤// │ NEO4J_URI │ No │ All Neo4j operations (Req 1,5,6,7,8,9) │// │ NEO4J_USERNAME │ No │ Neo4j auth (default: "neo4j") │// │ NEO4J_PASSWORD │ No │ Neo4j auth (default: "password") │// │ EXTRACTION_LLM_BASE_URL │ No │ Relationship extraction (Req 4,5) │// │ EXTRACTION_LLM_MODEL │ No │ LLM model selection (Req 4) │// │ EMBEDDING_PROVIDER │ No │ Embedding backend: "bert" | "openai-compatible" │// │ EMBEDDING_API_URL │ No │ External embedding API (Req 2) │// │ EMBEDDING_MODEL │ No │ External embedding model name (Req 2) │// │ EMBEDDING_DIMENSIONS │ No │ External embedding dimensions (Req 2) │// │ ENABLE_GRAPH_RETRIEVER │ No │ Graph retriever in ensemble (Req 7,8) │// │ ENTITY_RESOLUTION_THRESHOLD │ No │ Cosine similarity threshold (default: 0.85) │// │ SIDECAR_URL │ No │ All sidecar operations (Req 3,4) │// └─────────────────────────────────┴──────────┴─────────────────────────────────────────┘//// GATING LOGIC:// - NEO4J_URI unset → skip ENTIRE graph pipeline (Steps F, F2, G, G2, G3, G4)// No entity extraction, no relationship extraction,// no graph data stored anywhere. Pipeline ends at Step E.// - NEO4J_URI set + EXTRACTION_LLM_BASE_URL unset → skip relationship extraction only,// graph has BERT entities with CO_OCCURS relationships// - NEO4J_URI set + SIDECAR_URL unset → skip entity extraction, no graph data// - ENABLE_GRAPH_RETRIEVER=false → ensemble uses only BM25+Vector (2-retriever)// - EMBEDDING_PROVIDER unset → default to BERT via Sidecar (768-dim, free)//// MIGRATION FROM OLD SPEC:// - GEMMA_BASE_URL → EXTRACTION_LLM_BASE_URL (model-agnostic naming)// - GEMMA_MODEL → EXTRACTION_LLM_MODEL (model-agnostic naming)// - Old vars remain in src/env.ts for backward compatibility during transition// Zod validation schema additions for src/env.ts:constneo4jGraphRAGEnvSchema=z.object({// Neo4j connection (optional — enables graph storage)NEO4J_URI: optionalString(),NEO4J_USERNAME: optionalString(),NEO4J_PASSWORD: optionalString(),// Extraction LLM (optional — enables relationship extraction)// Model-agnostic: works with Ollama, LM Studio, OpenAI, Azure, any OpenAI-compatibleEXTRACTION_LLM_BASE_URL: optionalString(),EXTRACTION_LLM_MODEL: optionalString(),// Embedding provider (optional — defaults to BERT via Sidecar)EMBEDDING_PROVIDER: z.enum(["bert","openai-compatible"]).optional(),EMBEDDING_API_URL: optionalString(),EMBEDDING_MODEL: optionalString(),EMBEDDING_DIMENSIONS: z.coerce.number().int().positive().optional(),// Graph retriever toggleENABLE_GRAPH_RETRIEVER: z.preprocess((val)=>val==="true"||val==="1",z.boolean().optional()),// Entity resolutionENTITY_RESOLUTION_THRESHOLD: z.coerce.number().min(0).max(1).optional(),// Backward compatibility (old spec names — deprecated)GEMMA_BASE_URL: optionalString(),GEMMA_MODEL: optionalString(),});
Ingestion Pipeline Contract
Step ordering with inputs/outputs and gating logic:
═══════════════════════════════════════════════════════════════
EXISTING STEPS (A-E) — UNCHANGED, NEVER MODIFIED
═══════════════════════════════════════════════════════════════
Step A: Upload
Input: File (PDF, DOCX, etc.)
Output: Raw file stored, ocrJob created
Gate: Always runs
Step B: OCR / Normalize
Input: Raw file
Output: PageContent[] (text per page)
Gate: Always runs
Step C: Chunk
Input: PageContent[]
Output: DocumentChunk[] (parent + child chunks)
Gate: Always runs
Step D: Embed
Input: DocumentChunk[]
Output: VectorizedChunk[] (with 1536-dim OpenAI embeddings)
Gate: Always runs (uses OpenAI or Sidecar)
Step E: Store Sections
Input: VectorizedChunk[]
Output: StoredSection[] {sectionId, content} in Postgres
Gate: Always runs
═══════════════════════════════════════════════════════════════
NEO4J GRAPH PIPELINE — ALL GATED BY NEO4J_URI
═══════════════════════════════════════════════════════════════
When NEO4J_URI is NOT set, NONE of these steps run. No entity
extraction, no relationship extraction, no graph data stored
anywhere. The pipeline ends at Step E.
Step F: BERT NER Entity Extraction
Input: StoredSection[] (from Step E)
Output: Entities + CO_OCCURS relationships + CLS embeddings
Gate: NEO4J_URI is set AND SIDECAR_URL is set AND sidecar is healthy
Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres
Step F2: LLM Relationship Extraction (NEW)
Input: StoredSection[] (same chunks as Step F)
Output: Typed relationships (CEO_OF, ACQUIRED, etc.)
+ additional entities from LLM
Gate: NEO4J_URI is set AND EXTRACTION_LLM_BASE_URL is set AND sidecar is healthy
Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres
Note: Independent of Step F — if BERT fails, LLM still runs
Step G: Neo4j Direct Write
Input: All entities + relationships from Steps F and F2
(passed directly in memory, NOT read from Postgres)
Output: Neo4jWriteResult
Gate: NEO4J_URI is set AND Neo4j is healthy
Target: Neo4j graph store (direct MERGE queries)
Note: Neo4j is the ONLY place graph data lives.
Postgres kg_* tables are NOT used by this pipeline.
Step G2: Document Content Graph (NEW, Req 9)
Input: Document metadata + Section IDs + extracted topics
Output: Document, Topic nodes + CONTAINS, DISCUSSES, REFERENCES edges
Gate: NEO4J_URI is set AND Neo4j is healthy
Target: Neo4j graph store
Step G3: Entity Resolution (NEW, Req 6)
Input: Newly written entities with embeddings
Output: Merged entities + ALIAS_OF relationships
Gate: NEO4J_URI is set AND entity-embeddings vector index exists
Target: Neo4j graph store
Step G4: Community Detection (NEW, Req 11)
Input: Entity graph in Neo4j
Output: Community assignments + Community summary nodes
Gate: NEO4J_URI is set AND GDS plugin available
Target: Neo4j graph store
═══════════════════════════════════════════════════════════════
FAILURE ISOLATION
═══════════════════════════════════════════════════════════════
- Step F failure → Step F2 still attempts to run, Step G writes whatever data is available
- Step F2 failure → Step G still writes BERT-only data from Step F
- Step G failure → Graph data is lost for this document (nowhere else to store it)
Document is still marked as successfully ingested (Postgres Steps A-E are source of truth)
- Step G2 failure → Entity graph is still valid, just no content graph
- Step G3 failure → Entities exist but may have duplicates
- Step G4 failure → Graph works, just no community summaries
Ensemble Search Contract
How the graph retriever plugs into the existing ensemble:
// ═══════════════════════════════════════════════════════════════// ENSEMBLE SEARCH INTEGRATION// ═══════════════════════════════════════════════════════════════// Weight configurationsconstWEIGHTS_2_RETRIEVER=[0.4,0.6];// BM25, Vector (current default)constWEIGHTS_3_RETRIEVER=[0.3,0.5,0.2];// BM25, Vector, Graph (static default)// Dynamic weight adjustment (Req 8.4)// When graph retriever returns results, adjust weights based on match quality:// entityMatchCount >= 3 → graph weight increases to 0.3-0.5// entityMatchCount 1-2 → graph weight stays at 0.1-0.2// entityMatchCount 0 → graph weight drops to 0, effectively 2-retriever modeinterfaceGraphRetrieverMetrics{entityMatchCount: number;avgConfidence: number;traversalHops: number;}functioncomputeDynamicWeights(metrics: GraphRetrieverMetrics): [number,number,number]{if(metrics.entityMatchCount>=3&&metrics.avgConfidence>0.7){return[0.15,0.35,0.50];// Strong graph signal}if(metrics.entityMatchCount>=1){return[0.30,0.50,0.20];// Moderate graph signal}return[0.40,0.60,0.00];// No graph signal → 2-retriever fallback}// Graceful degradation interface// The ensemble MUST handle these failure modes without user-visible errors:// 1. Neo4j unreachable → use 2-retriever weights, log warning// 2. Graph retriever timeout (>2s) → use 2-retriever weights, log warning// 3. Graph retriever returns empty → use 2-retriever weights (normal behavior)// 4. Sidecar unreachable → skip reranking, return RRF-fused results
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Embedding output normalization
For any valid text input and any configured embedding backend (BERT or OpenAI-compatible), the EmbeddingProvider.embed() method SHALL return an array of floats with length equal to provider.dimensions, and EmbeddingProvider.embedBatch() SHALL return arrays all of the same dimensionality.
Validates: Requirements 2.1, 2.4
Property 2: BERT CLS embedding dimensionality
For any text chunk that produces at least one entity, when /extract-entities?include_embeddings=true is called, every entity in the response SHALL have an embedding field that is an array of exactly 768 floating-point numbers.
For any extraction response from POST /extract-relationships, every relationship in the relationships array SHALL have source and target values that each match the name of an entity in the same chunk's entities array, AND every relationship type SHALL match the regex ^[A-Z][A-Z0-9_]*$.
For any string containing <think>...</think> tags wrapping arbitrary text, followed by valid JSON content, the think-tag stripping function SHALL produce a string that is valid JSON-parseable.
Validates: Requirement 4.7
Property 5: Neo4j entity write completeness
For any entity written to Neo4j via the direct writer, the resulting Entity node SHALL contain all required properties (name, displayName, label, confidence, mentionCount, companyId), and when the input entity has a non-null embedding, the node SHALL have an embedding property of length 768. For any entity mention, there SHALL exist a MENTIONED_IN edge from the Entity node to the corresponding Section node.
Validates: Requirements 5.3, 5.5, 6.1
Property 6: Dynamic Cypher relationship types
For any relationship written to Neo4j, the Cypher relationship type SHALL be the actual relationship type string (e.g., CEO_OF, ACQUIRED, CO_OCCURS) and SHALL NOT be a generic RELATES_TO type with a type property.
Validates: Requirement 5.4
Property 7: Neo4j write idempotence
For any set of entities and relationships, writing them to Neo4j twice via the direct writer SHALL produce the same graph state (same node count, same edge count, same property values) as writing them once.
Validates: Requirement 5.7
Property 8: Entity resolution merges duplicates
For any two Entity nodes within the same companyId whose embedding cosine similarity exceeds the configured threshold (default 0.85), the entity resolution module SHALL merge them into a single canonical Entity node and create an ALIAS_OF relationship from the alias to the canonical entity.
Validates: Requirements 6.3, 6.4
Property 9: Graph retriever traverses all relationship types
For any Neo4j graph containing entities connected by multiple relationship types (e.g., CEO_OF, ACQUIRED, CO_OCCURS), the graph retriever SHALL traverse all relationship types when finding connected entities — not just CO_OCCURS.
Validates: Requirement 7.1
Property 10: Dynamic ensemble weight adjustment
For any graph retriever result with entityMatchCount >= 3 and avgConfidence > 0.7, the ensemble SHALL assign a graph weight of at least 0.3. For any result with entityMatchCount == 0, the graph weight SHALL be 0 (effectively 2-retriever mode).
Validates: Requirement 8.4
Property 11: Document content graph structure
For any document ingested with Neo4j enabled, there SHALL exist a Document node with the correct id, name, companyId, and uploadedAt properties, and for each section belonging to that document, there SHALL exist a CONTAINS relationship from the Document node to the Section node.
Validates: Requirements 9.1, 9.2
Property 12: Cross-document and topic linking
For any two documents within the same companyId that share 3 or more entities, there SHALL exist a REFERENCES relationship between their Document nodes with a sharedEntityCount property. For any two Topic nodes within the same companyId with embedding cosine similarity above 0.8, there SHALL exist a RELATED_TO relationship.
Validates: Requirements 9.4, 9.5
Property 13: Combined graph+vector scoring
For any retrieval result from the graph-guided hybrid retriever, the result score SHALL incorporate both graph proximity (inversely proportional to hop distance) and vector similarity (cosine similarity to query embedding).
Validates: Requirement 10.3
Property 14: Community node data shape
For any detected community, there SHALL exist a Community node in Neo4j with id, summary (non-empty string), companyId, and embedding (768-dim float array) properties.
Validates: Requirement 11.3
Property 15: Text2Cypher read-only validation
For any Cypher query generated by the Text2Cypher module, the read-only validator SHALL reject queries containing CREATE, DELETE, SET, MERGE, or DETACH keywords (case-insensitive), preventing accidental graph mutations.
Validates: Requirement 12.4
Property 16: Postgres search independence
For any search query executed against the same dataset, the BM25+Vector retrieval results (excluding the graph retriever contribution) SHALL be identical whether Neo4j is enabled or disabled — Neo4j is purely additive and never modifies Postgres search behavior.
Validates: Requirement 13.1
Error Handling
Graceful Degradation Cascade
Every Neo4j feature follows a strict degradation hierarchy. No optional component failure should ever cause a user-visible error or block document ingestion.
Component Health Check Order (at pipeline start):
1. Sidecar (/health) → gates Steps F, F2
2. Neo4j (RETURN 1) → gates Steps G, G2, G3, G4
3. Extraction LLM (/health or first call) → gates Step F2
4. GDS plugin (CALL gds.graph.list()) → gates Step G4
Failure Handling:
┌──────────────────────────┬────────────────────────────────────────────┐
│ Component Down │ Behavior │
├──────────────────────────┼────────────────────────────────────────────┤
│ NEO4J_URI not set │ Entire graph pipeline disabled │
│ │ No entity extraction, no relationships │
│ │ No graph data stored anywhere │
│ │ Pipeline ends at Step E (Postgres only) │
├──────────────────────────┼────────────────────────────────────────────┤
│ Sidecar unreachable │ Skip entity extraction (Step F) │
│ (NEO4J_URI is set) │ No graph data for this document │
│ │ Document still ingested (Steps A-E ok) │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j unreachable │ Skip Neo4j writes (Step G) │
│ (NEO4J_URI is set) │ Entity extraction still runs but data │
│ │ is discarded (nowhere to write it) │
│ │ Ensemble search uses 2-retriever mode │
├──────────────────────────┼────────────────────────────────────────────┤
│ Extraction LLM down │ Skip relationship extraction (Step F2) │
│ │ Graph has BERT entities + CO_OCCURS only │
├──────────────────────────┼────────────────────────────────────────────┤
│ GDS plugin unavailable │ Skip community detection (Step G4) │
│ │ All other graph operations work normally │
├──────────────────────────┼────────────────────────────────────────────┤
│ Vector index missing │ Skip entity resolution (Step G3) │
│ │ Entities may have duplicates │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j down at query time │ Graph retriever returns empty [] │
│ │ Ensemble falls back to BM25+Vector │
│ │ No user-visible error │
├──────────────────────────┼────────────────────────────────────────────┤
│ External embedding API │ Fall back to BERT via Sidecar │
│ unreachable │ Log warning, continue with 768-dim │
└──────────────────────────┴────────────────────────────────────────────┘
Error Logging Convention
All Neo4j-related errors use structured log prefixes for easy filtering:
[Neo4jWriter] — Direct write operations
[Neo4jRetriever] — Query-time graph retrieval
[EntityResolution] — Duplicate entity merging
[ExtractionLLM] — Relationship extraction via LLM
[EmbeddingProvider] — Embedding generation
[CommunityDetect] — Leiden clustering and summaries
[Text2Cypher] — Natural language to Cypher translation
Testing Strategy
Dual Testing Approach
Unit tests: Specific examples, edge cases, error conditions, graceful degradation paths
Property tests: Universal properties across all inputs (fast-check, minimum 100 iterations)
Contract tests: Zod schema validation — runnable today with no infrastructure
Integration tests: Cross-component data flow with mocked boundaries
Parent Contract Test Strategy
These test files validate parent-level contracts. They are runnable immediately with no Neo4j, no Sidecar, no LLM — just Zod schema validation.
Design Document — Neo4j GraphRAG (Parent Vision Spec)
Overview
This parent design defines the high-level architecture and shared contracts for the Neo4j GraphRAG system — an opt-in graph+vector search engine that enriches the existing Postgres-based semantic search pipeline. Neo4j operates as a completely independent feature set: when
NEO4J_URIis set, the entire graph pipeline activates (entity extraction, relationship extraction, direct Neo4j writes); when it is NOT set, none of these steps run and no graph data is stored anywhere — not in Postgres, not anywhere. Neo4j is the ONLY place graph data lives. The existing BM25+Vector ensemble search continues unchanged regardless.The system is model-agnostic: BERT (default, free, local) handles entity extraction and embeddings; any OpenAI-compatible LLM (Ollama, LM Studio, OpenAI, Azure) handles relationship extraction. Every component degrades gracefully when its dependencies are unavailable.
This design establishes a two-tier contract system:
Architecture
System Diagram
graph TB subgraph "Client" UI[Web App] end subgraph "Next.js App" API[API Routes] Ingestion[Ingestion Pipeline<br/>Steps A-G+] Ensemble[Ensemble Search<br/>BM25 + Vector + Graph] Reranker[Reranker Client] end subgraph "Data Stores" PG[(PostgreSQL + pgvector<br/>Source of Truth)] Neo4j[(Neo4j CE<br/>Graph + Vector<br/>Optional)] end subgraph "Sidecar (FastAPI)" Embed["/embed<br/>BERT Embeddings"] NER["/extract-entities<br/>BERT NER + CLS Embeddings"] RelEx["/extract-relationships<br/>Model-Agnostic LLM"] RerankModel["/rerank<br/>Cross-Encoder"] end subgraph "External (Optional)" Ollama[Ollama / LM Studio / OpenAI<br/>Any OpenAI-Compatible LLM] EmbAPI[External Embedding API<br/>Optional] end UI --> API API --> Ingestion API --> Ensemble Ingestion -->|Steps A-E: always runs| PG Ingestion -->|"Step F: BERT NER (only when NEO4J_URI set)"| NER Ingestion -->|"Step F2: Relationship Extraction (only when NEO4J_URI set)"| RelEx Ingestion -->|"Step G: Direct Write (only when NEO4J_URI set)"| Neo4j Ensemble -->|BM25 + Vector| PG Ensemble -->|"Graph Retriever (only when NEO4J_URI set)"| Neo4j Ensemble --> Reranker --> RerankModel RelEx -->|OpenAI-compatible API| Ollama NER -->|Local BERT model| NER Neo4j -.->|Section content lookup| PG style Neo4j fill:#e1f5fe,stroke:#0288d1 style PG fill:#e8f5e9,stroke:#388e3c style Ollama fill:#fff3e0,stroke:#f57c00Data Flow: Ingestion Pipeline
flowchart LR A[Upload] --> B[OCR/Normalize] B --> C[Chunk] C --> D[Embed via OpenAI or Sidecar] D --> E[Store Sections in Postgres] E --> CHECK{"NEO4J_URI set?"} CHECK -->|No| DONE[Pipeline Complete<br/>Postgres-only, no graph data anywhere] CHECK -->|Yes| F["Step F: BERT NER → Entities<br/>(+ CLS Embeddings)"] F --> F2["Step F2: LLM Relationship Extraction<br/>(gated: EXTRACTION_LLM_BASE_URL)"] F2 --> G["Step G: Direct Neo4j Write<br/>Entities + Relationships + Sections"] style G fill:#e1f5fe,stroke:#0288d1 style F fill:#e1f5fe,stroke:#0288d1 style F2 fill:#fff3e0,stroke:#f57c00 style DONE fill:#e8f5e9,stroke:#388e3cKey architectural decision:
NEO4J_URIgates the ENTIRE graph feature set. WhenNEO4J_URIis NOT set, Steps F, F2, and G do not run at all — no entity extraction, no relationship extraction, no graph data stored anywhere (not in Postgreskg_*tables, not anywhere). The graph pipeline simply does not exist for users who don't enable Neo4j. WhenNEO4J_URIIS set, entities and relationships are extracted and written directly to Neo4j — Neo4j is the ONLY place graph data lives. The Postgreskg_*tables are a legacy artifact from the old architecture and are not used by this feature.Data Flow: Query Pipeline
flowchart LR Q[User Query] --> E1[BERT NER on Query<br/>Extract Entities + Embedding] E1 --> ES{Ensemble Search} ES --> BM25[BM25 Retriever<br/>Postgres] ES --> Vec[Vector Retriever<br/>pgvector] ES --> GR["Graph Retriever<br/>Neo4j (when enabled)"] BM25 --> RRF[Reciprocal Rank Fusion] Vec --> RRF GR --> RRF RRF --> Rerank[Sidecar Cross-Encoder Rerank] Rerank --> Answer[LLM Answer Generation] style GR fill:#e1f5fe,stroke:#0288d1Feature Packet Dependency Graph
graph TD R1[Req 1: Neo4j Docker Infrastructure] --> R5 R2[Req 2: Embedding Provider Abstraction] --> R6 R3[Req 3: BERT Entity CLS Embeddings] --> R5 R4[Req 4: Relationship Extraction Endpoint] --> R5 R5[Req 5: Direct Neo4j Write Pipeline] --> R6 R5 --> R7 R5 --> R9 R6[Req 6: Vector Index + Entity Resolution] --> R7 R6 --> R10 R7[Req 7: Enhanced Graph Retriever] --> R8 R7 --> R10 R8[Req 8: Modular Ensemble Search] R9[Req 9: Document Content Graph] --> R10 R10[Req 10: Graph-Guided Hybrid Retrieval] --> R11 R10 --> R12 R11[Req 11: Community Detection] R12[Req 12: Text2Cypher] R13[Req 13: Postgres Independence<br/>Cross-Cutting] -.-> R5 R13 -.-> R8 R14[Req 14: Graceful Degradation<br/>Cross-Cutting] -.-> R5 R14 -.-> R7 R14 -.-> R8 style R1 fill:#c8e6c9 style R2 fill:#c8e6c9 style R3 fill:#c8e6c9 style R4 fill:#c8e6c9 style R13 fill:#ffecb3 style R14 fill:#ffecb3 classDef foundation fill:#c8e6c9,stroke:#388e3c classDef crosscut fill:#ffecb3,stroke:#f57c00Critical path: R1 → R5 → R6 → R7 → R10 → R11/R12
Parallelizable foundations (no dependencies): R1, R2, R3, R4 can all be built simultaneously.
Components and Interfaces
1. Sidecar API Contracts
POST /extract-entities(Enhanced)Existing endpoint enhanced with optional CLS embeddings. Backward compatible.
POST /extract-relationships(New)Model-agnostic relationship extraction via any OpenAI-compatible LLM.
2. Embedding Provider Interface
3. Neo4j Direct Writer Interface
Data Models
Neo4j Graph Schema
Shared Zod Schemas (Parent-Level Contracts)
These are the runtime-validated contracts that ALL sub-feature specs must conform to. Updated from the old
gemma-schemas.tswith model-agnostic naming and new node types.Environment Variable Contract
All Neo4j-related environment variables across all requirements, with validation rules:
Ingestion Pipeline Contract
Step ordering with inputs/outputs and gating logic:
Ensemble Search Contract
How the graph retriever plugs into the existing ensemble:
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Embedding output normalization
For any valid text input and any configured embedding backend (BERT or OpenAI-compatible), the
EmbeddingProvider.embed()method SHALL return an array of floats with length equal toprovider.dimensions, andEmbeddingProvider.embedBatch()SHALL return arrays all of the same dimensionality.Validates: Requirements 2.1, 2.4
Property 2: BERT CLS embedding dimensionality
For any text chunk that produces at least one entity, when
/extract-entities?include_embeddings=trueis called, every entity in the response SHALL have anembeddingfield that is an array of exactly 768 floating-point numbers.Validates: Requirements 3.1, 3.3
Property 3: Relationship extraction structural validity
For any extraction response from
POST /extract-relationships, every relationship in therelationshipsarray SHALL havesourceandtargetvalues that each match thenameof an entity in the same chunk'sentitiesarray, AND every relationshiptypeSHALL match the regex^[A-Z][A-Z0-9_]*$.Validates: Requirements 4.2, 4.3
Property 4: Think-tag stripping preserves JSON content
For any string containing
<think>...</think>tags wrapping arbitrary text, followed by valid JSON content, the think-tag stripping function SHALL produce a string that is valid JSON-parseable.Validates: Requirement 4.7
Property 5: Neo4j entity write completeness
For any entity written to Neo4j via the direct writer, the resulting Entity node SHALL contain all required properties (
name,displayName,label,confidence,mentionCount,companyId), and when the input entity has a non-null embedding, the node SHALL have anembeddingproperty of length 768. For any entity mention, there SHALL exist aMENTIONED_INedge from the Entity node to the corresponding Section node.Validates: Requirements 5.3, 5.5, 6.1
Property 6: Dynamic Cypher relationship types
For any relationship written to Neo4j, the Cypher relationship type SHALL be the actual relationship type string (e.g.,
CEO_OF,ACQUIRED,CO_OCCURS) and SHALL NOT be a genericRELATES_TOtype with atypeproperty.Validates: Requirement 5.4
Property 7: Neo4j write idempotence
For any set of entities and relationships, writing them to Neo4j twice via the direct writer SHALL produce the same graph state (same node count, same edge count, same property values) as writing them once.
Validates: Requirement 5.7
Property 8: Entity resolution merges duplicates
For any two Entity nodes within the same
companyIdwhose embedding cosine similarity exceeds the configured threshold (default 0.85), the entity resolution module SHALL merge them into a single canonical Entity node and create anALIAS_OFrelationship from the alias to the canonical entity.Validates: Requirements 6.3, 6.4
Property 9: Graph retriever traverses all relationship types
For any Neo4j graph containing entities connected by multiple relationship types (e.g.,
CEO_OF,ACQUIRED,CO_OCCURS), the graph retriever SHALL traverse all relationship types when finding connected entities — not justCO_OCCURS.Validates: Requirement 7.1
Property 10: Dynamic ensemble weight adjustment
For any graph retriever result with
entityMatchCount >= 3andavgConfidence > 0.7, the ensemble SHALL assign a graph weight of at least 0.3. For any result withentityMatchCount == 0, the graph weight SHALL be 0 (effectively 2-retriever mode).Validates: Requirement 8.4
Property 11: Document content graph structure
For any document ingested with Neo4j enabled, there SHALL exist a Document node with the correct
id,name,companyId, anduploadedAtproperties, and for each section belonging to that document, there SHALL exist aCONTAINSrelationship from the Document node to the Section node.Validates: Requirements 9.1, 9.2
Property 12: Cross-document and topic linking
For any two documents within the same
companyIdthat share 3 or more entities, there SHALL exist aREFERENCESrelationship between their Document nodes with asharedEntityCountproperty. For any two Topic nodes within the samecompanyIdwith embedding cosine similarity above 0.8, there SHALL exist aRELATED_TOrelationship.Validates: Requirements 9.4, 9.5
Property 13: Combined graph+vector scoring
For any retrieval result from the graph-guided hybrid retriever, the result score SHALL incorporate both graph proximity (inversely proportional to hop distance) and vector similarity (cosine similarity to query embedding).
Validates: Requirement 10.3
Property 14: Community node data shape
For any detected community, there SHALL exist a Community node in Neo4j with
id,summary(non-empty string),companyId, andembedding(768-dim float array) properties.Validates: Requirement 11.3
Property 15: Text2Cypher read-only validation
For any Cypher query generated by the Text2Cypher module, the read-only validator SHALL reject queries containing
CREATE,DELETE,SET,MERGE, orDETACHkeywords (case-insensitive), preventing accidental graph mutations.Validates: Requirement 12.4
Property 16: Postgres search independence
For any search query executed against the same dataset, the BM25+Vector retrieval results (excluding the graph retriever contribution) SHALL be identical whether Neo4j is enabled or disabled — Neo4j is purely additive and never modifies Postgres search behavior.
Validates: Requirement 13.1
Error Handling
Graceful Degradation Cascade
Every Neo4j feature follows a strict degradation hierarchy. No optional component failure should ever cause a user-visible error or block document ingestion.
Error Logging Convention
All Neo4j-related errors use structured log prefixes for easy filtering:
Testing Strategy
Dual Testing Approach
Parent Contract Test Strategy
These test files validate parent-level contracts. They are runnable immediately with no Neo4j, no Sidecar, no LLM — just Zod schema validation.
contract.extract-entities.test.tsExtractEntitiesResponseSchema,ExtractEntitiesEnhancedResponseSchemacontract.extract-relationships.test.tsExtractRelationshipsResponseSchema,ExtractionEntitySchema,ExtractionRelationshipSchemacontract.neo4j-write-result.test.tsNeo4jWriteResultSchemacontract.neo4j-node-shapes.test.tsNeo4jEntityNodeSchema,Neo4jSectionNodeSchema,Neo4jDocumentNodeSchema,Neo4jTopicNodeSchema,Neo4jCommunityNodeSchemacontract.embedding-provider.test.tsEmbeddingResultSchema,EmbeddingBatchResultSchemacontract.env-vars.test.tsneo4jGraphRAGEnvSchemacontract.ensemble-search.test.tsGraphRetrieverResultSchemaAgent hook: On sub-feature spec completion, the agent runs
pnpm test -- __tests__/api/graphrag/contracts/to validate all parent contracts still pass.Property-Based Test Configuration
fast-check(already in devDependencies)Feature: neo4j-graphrag, Property {N}: {title}*.pbt.test.tsfileCross-Cutting Concern Tests
Postgres Independence (Req 13):
NEO4J_URIset, once without), compare BM25+Vector resultsGraceful Degradation (Req 14):
Model-Agnostic Behavior (Reqs 2, 4):
EXTRACTION_LLM_BASE_URLvalues, verify the same code path is used