Neo4j Full Implementation

# Design Document — Neo4j GraphRAG (Parent Vision Spec)

## Overview

This parent design defines the high-level architecture and shared contracts for the Neo4j GraphRAG system — an opt-in graph+vector search engine that enriches the existing Postgres-based semantic search pipeline. Neo4j operates as a completely independent feature set: when `NEO4J_URI` is set, the entire graph pipeline activates (entity extraction, relationship extraction, direct Neo4j writes); when it is NOT set, none of these steps run and no graph data is stored anywhere — not in Postgres, not anywhere. Neo4j is the ONLY place graph data lives. The existing BM25+Vector ensemble search continues unchanged regardless.

The system is model-agnostic: BERT (default, free, local) handles entity extraction and embeddings; any OpenAI-compatible LLM (Ollama, LM Studio, OpenAI, Azure) handles relationship extraction. Every component degrades gracefully when its dependencies are unavailable.

This design establishes a two-tier contract system:
1. **Parent contracts** — shared Zod schemas, API interfaces, Neo4j node/relationship shapes, env var conventions that ALL sub-feature specs must conform to
2. **Sub-feature contracts** — defined later in each sub-spec for workstream-level boundaries

---

## Architecture

### System Diagram

```mermaid
graph TB
 subgraph "Client"
 UI[Web App]
 end

 subgraph "Next.js App"
 API[API Routes]
 Ingestion[Ingestion Pipeline Steps A-G+]
 Ensemble[Ensemble Search BM25 + Vector + Graph]
 Reranker[Reranker Client]
 end

 subgraph "Data Stores"
 PG[(PostgreSQL + pgvector Source of Truth)]
 Neo4j[(Neo4j CE Graph + Vector Optional)]
 end

 subgraph "Sidecar (FastAPI)"
 Embed["/embed BERT Embeddings"]
 NER["/extract-entities BERT NER + CLS Embeddings"]
 RelEx["/extract-relationships Model-Agnostic LLM"]
 RerankModel["/rerank Cross-Encoder"]
 end

 subgraph "External (Optional)"
 Ollama[Ollama / LM Studio / OpenAI Any OpenAI-Compatible LLM]
 EmbAPI[External Embedding API Optional]
 end

 UI --> API
 API --> Ingestion
 API --> Ensemble

 Ingestion -->|Steps A-E: always runs| PG
 Ingestion -->|"Step F: BERT NER (only when NEO4J_URI set)"| NER
 Ingestion -->|"Step F2: Relationship Extraction (only when NEO4J_URI set)"| RelEx
 Ingestion -->|"Step G: Direct Write (only when NEO4J_URI set)"| Neo4j

 Ensemble -->|BM25 + Vector| PG
 Ensemble -->|"Graph Retriever (only when NEO4J_URI set)"| Neo4j
 Ensemble --> Reranker --> RerankModel

 RelEx -->|OpenAI-compatible API| Ollama
 NER -->|Local BERT model| NER

 Neo4j -.->|Section content lookup| PG

 style Neo4j fill:#e1f5fe,stroke:#0288d1
 style PG fill:#e8f5e9,stroke:#388e3c
 style Ollama fill:#fff3e0,stroke:#f57c00
```

### Data Flow: Ingestion Pipeline

```mermaid
flowchart LR
 A[Upload] --> B[OCR/Normalize]
 B --> C[Chunk]
 C --> D[Embed via OpenAI or Sidecar]
 D --> E[Store Sections in Postgres]
 E --> CHECK{"NEO4J_URI set?"}
 CHECK -->|No| DONE[Pipeline Complete Postgres-only, no graph data anywhere]
 CHECK -->|Yes| F["Step F: BERT NER → Entities (+ CLS Embeddings)"]
 F --> F2["Step F2: LLM Relationship Extraction (gated: EXTRACTION_LLM_BASE_URL)"]
 F2 --> G["Step G: Direct Neo4j Write Entities + Relationships + Sections"]

 style G fill:#e1f5fe,stroke:#0288d1
 style F fill:#e1f5fe,stroke:#0288d1
 style F2 fill:#fff3e0,stroke:#f57c00
 style DONE fill:#e8f5e9,stroke:#388e3c
```

**Key architectural decision:** `NEO4J_URI` gates the ENTIRE graph feature set. When `NEO4J_URI` is NOT set, Steps F, F2, and G do not run at all — no entity extraction, no relationship extraction, no graph data stored anywhere (not in Postgres `kg_*` tables, not anywhere). The graph pipeline simply does not exist for users who don't enable Neo4j. When `NEO4J_URI` IS set, entities and relationships are extracted and written directly to Neo4j — Neo4j is the ONLY place graph data lives. The Postgres `kg_*` tables are a legacy artifact from the old architecture and are not used by this feature.

### Data Flow: Query Pipeline

```mermaid
flowchart LR
 Q[User Query] --> E1[BERT NER on Query Extract Entities + Embedding]
 E1 --> ES{Ensemble Search}
 ES --> BM25[BM25 Retriever Postgres]
 ES --> Vec[Vector Retriever pgvector]
 ES --> GR["Graph Retriever Neo4j (when enabled)"]
 BM25 --> RRF[Reciprocal Rank Fusion]
 Vec --> RRF
 GR --> RRF
 RRF --> Rerank[Sidecar Cross-Encoder Rerank]
 Rerank --> Answer[LLM Answer Generation]

 style GR fill:#e1f5fe,stroke:#0288d1
```

### Feature Packet Dependency Graph

```mermaid
graph TD
 R1[Req 1: Neo4j Docker Infrastructure] --> R5
 R2[Req 2: Embedding Provider Abstraction] --> R6
 R3[Req 3: BERT Entity CLS Embeddings] --> R5
 R4[Req 4: Relationship Extraction Endpoint] --> R5
 R5[Req 5: Direct Neo4j Write Pipeline] --> R6
 R5 --> R7
 R5 --> R9
 R6[Req 6: Vector Index + Entity Resolution] --> R7
 R6 --> R10
 R7[Req 7: Enhanced Graph Retriever] --> R8
 R7 --> R10
 R8[Req 8: Modular Ensemble Search]
 R9[Req 9: Document Content Graph] --> R10
 R10[Req 10: Graph-Guided Hybrid Retrieval] --> R11
 R10 --> R12
 R11[Req 11: Community Detection]
 R12[Req 12: Text2Cypher]

 R13[Req 13: Postgres Independence Cross-Cutting] -.-> R5
 R13 -.-> R8
 R14[Req 14: Graceful Degradation Cross-Cutting] -.-> R5
 R14 -.-> R7
 R14 -.-> R8

 style R1 fill:#c8e6c9
 style R2 fill:#c8e6c9
 style R3 fill:#c8e6c9
 style R4 fill:#c8e6c9
 style R13 fill:#ffecb3
 style R14 fill:#ffecb3

 classDef foundation fill:#c8e6c9,stroke:#388e3c
 classDef crosscut fill:#ffecb3,stroke:#f57c00
```

**Critical path:** R1 → R5 → R6 → R7 → R10 → R11/R12

**Parallelizable foundations (no dependencies):** R1, R2, R3, R4 can all be built simultaneously.

---

## Components and Interfaces

### 1. Sidecar API Contracts

#### `POST /extract-entities` (Enhanced)

Existing endpoint enhanced with optional CLS embeddings. Backward compatible.

```typescript
// Request (unchanged)
interface ExtractEntitiesRequest {
 chunks: string[];
}

// Response when called WITHOUT ?include_embeddings=true (unchanged)
interface ExtractEntitiesResponse {
 results: { text: string; entities: { text: string; label: string; score: number }[] }[];
 total_entities: number;
}

// Response when called WITH ?include_embeddings=true (enhanced)
interface ExtractEntitiesEnhancedResponse {
 results: {
 text: string;
 entities: {
 text: string;
 label: string;
 score: number;
 embedding: number[]; // 768-dim BERT CLS vector
 }[];
 }[];
 total_entities: number;
}
```

#### `POST /extract-relationships` (New)

Model-agnostic relationship extraction via any OpenAI-compatible LLM.

```typescript
interface ExtractRelationshipsRequest {
 chunks: string[];
 known_entities?: string[]; // Optional: constrain relationship targets
}

interface ExtractionEntity {
 name: string;
 type: "PERSON" | "ORGANIZATION" | "LOCATION" | "PRODUCT" | "EVENT" | "OTHER";
}

interface ExtractionRelationship {
 source: string; // Must match an entity name
 target: string; // Must match an entity name
 type: string; // SCREAMING_SNAKE_CASE: ^[A-Z][A-Z0-9_]*$
 detail: string; // Brief evidence description
}

interface ExtractionChunkResult {
 text: string;
 entities: ExtractionEntity[];
 relationships: ExtractionRelationship[];
 dropped_relationships: ExtractionRelationship[]; // Invalid source/target
}

interface ExtractRelationshipsResponse {
 results: ExtractionChunkResult[];
 total_entities: number;
 total_relationships: number;
 total_dropped: number;
}
```

### 2. Embedding Provider Interface

```typescript
/** Model-agnostic embedding provider — all sub-features use this interface */
interface EmbeddingProvider {
 /** Embed a single text string */
 embed(text: string): Promise<number[]>;
 /** Embed multiple texts in a batch */
 embedBatch(texts: string[]): Promise<number[][]>;
 /** The dimensionality of output vectors */
 readonly dimensions: number;
 /** The provider name for logging */
 readonly providerName: string;
}

/** Factory: reads env vars, returns the appropriate provider */
function createEmbeddingProvider(): EmbeddingProvider;
// - No EMBEDDING_PROVIDER or EMBEDDING_PROVIDER=bert → BertEmbeddingProvider (768-dim, calls Sidecar /embed)
// - EMBEDDING_PROVIDER=openai-compatible → OpenAICompatibleEmbeddingProvider (configurable dim)
// - Fallback: if external API unreachable → BertEmbeddingProvider with warning log
```

### 3. Neo4j Direct Writer Interface

```typescript
/** Replaces the old neo4j-sync.ts "read from Postgres" pattern */
interface Neo4jDirectWriter {
 /** Write entities directly to Neo4j (idempotent MERGE) */
 writeEntities(entities: Neo4jEntityInput[], companyId: string): Promise<number>;
 /** Write relationships with dynamic Cypher types (idempotent MERGE) */
 writeRelationships(relationships: Neo4jRelationshipInput[], companyId: string): Promise<string[]>;
 /** Write section nodes and MENTIONED_IN edges */
 writeMentions(mentions: Neo4jMentionInput[], companyId: string): Promise<number>;
 /** Write document content graph nodes (Document, Topic, cross-doc links) */
 writeDocumentGraph(doc: Neo4jDocumentGraphInput, companyId: string): Promise<void>;
 /** Ensure vector indexes exist (idempotent) */
 ensureIndexes(): Promise<void>;
}

interface Neo4jEntityInput {
 name: string; // normalized lowercase
 displayName: string; // original casing
 label: string; // PER, ORG, LOC, PRODUCT, EVENT, MISC, OTHER
 confidence: number;
 mentionCount: number;
 companyId: string;
 embedding?: number[]; // 768-dim BERT CLS vector (nullable)
}

interface Neo4jRelationshipInput {
 sourceName: string;
 sourceLabel: string;
 targetName: string;
 targetLabel: string;
 relationType: string; // SCREAMING_SNAKE_CASE dynamic type
 weight: number;
 evidenceCount: number;
 detail?: string;
 documentId: number;
 companyId: string;
}

interface Neo4jMentionInput {
 entityName: string;
 entityLabel: string;
 sectionId: number;
 documentId: number;
 confidence: number;
 companyId: string;
}

interface Neo4jWriteResult {
 entities: number;
 mentions: number;
 relationships: number;
 dynamicRelTypes: string[];
 durationMs: number;
}
```

---

## Data Models

### Neo4j Graph Schema

```cypher
-- ═══════════════════════════════════════════════════════════════
-- NODE TYPES
-- ═══════════════════════════════════════════════════════════════

-- Entity node (core knowledge graph node)
(:Entity {
 name: String, -- normalized lowercase (MERGE key)
 displayName: String, -- original casing
 label: String, -- PER, ORG, LOC, PRODUCT, EVENT, MISC, OTHER (MERGE key)
 confidence: Float, -- average extraction confidence
 mentionCount: Integer, -- total mentions across all documents
 companyId: String, -- company scope (MERGE key)
 embedding: List<Float>, -- 768-dim BERT CLS vector (nullable)
 communityId: Integer -- Leiden community assignment (nullable, Req 11)
})

-- Section node (lightweight — content stays in Postgres)
(:Section {
 id: Integer, -- matches documentSections.id in Postgres
 documentId: Integer -- matches document.id in Postgres
})

-- Document node (Req 9: content graph)
(:Document {
 id: Integer, -- matches document.id in Postgres
 name: String, -- document name
 companyId: String, -- company scope
 uploadedAt: String -- ISO timestamp
})

-- Topic node (Req 9: content graph)
(:Topic {
 name: String, -- topic name
 companyId: String, -- company scope
 embedding: List<Float> -- 768-dim embedding for similarity
})

-- Community node (Req 11: community detection)
(:Community {
 id: Integer, -- community identifier
 summary: String, -- LLM-generated 2-3 sentence summary
 companyId: String, -- company scope
 embedding: List<Float> -- embedding of the summary text
})

-- ═══════════════════════════════════════════════════════════════
-- RELATIONSHIP TYPES
-- ═══════════════════════════════════════════════════════════════

-- Entity ↔ Section (mentions)
(:Entity)-[:MENTIONED_IN {confidence: Float}]->(:Section)

-- Entity ↔ Entity (dynamic types — NOT generic RELATES_TO)
-- Examples: CEO_OF, ACQUIRED, HEADQUARTERED_IN, COMPETES_WITH, CO_OCCURS
(:Entity)-[:<DYNAMIC_TYPE> {
 weight: Float,
 evidenceCount: Integer,
 detail: String, -- evidence text (nullable)
 documentId: Integer -- source document
}]->(:Entity)

-- Entity resolution (Req 6)
(:Entity)-[:ALIAS_OF]->(:Entity) -- alias → canonical

-- Document content graph (Req 9)
(:Document)-[:CONTAINS]->(:Section)
(:Section)-[:DISCUSSES]->(:Topic)
(:Document)-[:REFERENCES {sharedEntityCount: Integer}]->(:Document)
(:Topic)-[:RELATED_TO]->(:Topic)

-- Community membership (Req 11)
(:Entity)-[:BELONGS_TO]->(:Community)

-- ═══════════════════════════════════════════════════════════════
-- VECTOR INDEXES
-- ═══════════════════════════════════════════════════════════════

-- Entity embeddings (768-dim, cosine similarity)
CREATE VECTOR INDEX `entity-embeddings` IF NOT EXISTS
FOR (e:Entity) ON (e.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};

-- Topic embeddings (768-dim, cosine similarity) — Req 9
CREATE VECTOR INDEX `topic-embeddings` IF NOT EXISTS
FOR (t:Topic) ON (t.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};

-- Community summary embeddings — Req 11
CREATE VECTOR INDEX `community-embeddings` IF NOT EXISTS
FOR (c:Community) ON (c.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};
```

### Shared Zod Schemas (Parent-Level Contracts)

These are the runtime-validated contracts that ALL sub-feature specs must conform to. Updated from the old `gemma-schemas.ts` with model-agnostic naming and new node types.

```typescript
import { z } from "zod";

// ═══════════════════════════════════════════════════════════════
// SIDECAR API SCHEMAS
// ═══════════════════════════════════════════════════════════════

// ── BERT Entity (base, without embedding) ────────────────────

export const EntityBaseSchema = z.object({
 text: z.string().min(1),
 label: z.string().min(1),
 score: z.number().min(0).max(1),
});

// ── BERT Entity (with CLS embedding) ────────────────────────

export const EntityWithEmbeddingSchema = EntityBaseSchema.extend({
 embedding: z.array(z.number()).length(768),
});

// ── Extract Entities Response (base, backward compatible) ────

export const ExtractEntitiesResponseSchema = z.object({
 results: z.array(z.object({
 text: z.string(),
 entities: z.array(EntityBaseSchema),
 })),
 total_entities: z.number().int().nonnegative(),
});

// ── Extract Entities Enhanced Response (with embeddings) ─────

export const ExtractEntitiesEnhancedResponseSchema = z.object({
 results: z.array(z.object({
 text: z.string(),
 entities: z.array(EntityWithEmbeddingSchema),
 })),
 total_entities: z.number().int().nonnegative(),
});

// ── Extraction Entity (LLM-extracted, model-agnostic) ────────

export const ExtractionEntitySchema = z.object({
 name: z.string().min(1),
 type: z.enum(["PERSON", "ORGANIZATION", "LOCATION", "PRODUCT", "EVENT", "OTHER"]),
});

// ── Extraction Relationship ──────────────────────────────────

export const ExtractionRelationshipSchema = z.object({
 source: z.string().min(1),
 target: z.string().min(1),
 type: z.string().min(1).regex(/^[A-Z][A-Z0-9_]*$/), // SCREAMING_SNAKE_CASE
 detail: z.string(),
});

// ── Extraction Chunk Result ──────────────────────────────────

export const ExtractionChunkResultSchema = z.object({
 text: z.string(),
 entities: z.array(ExtractionEntitySchema),
 relationships: z.array(ExtractionRelationshipSchema),
 dropped_relationships: z.array(ExtractionRelationshipSchema),
});

// ── Extract Relationships Response ───────────────────────────

export const ExtractRelationshipsResponseSchema = z.object({
 results: z.array(ExtractionChunkResultSchema),
 total_entities: z.number().int().nonnegative(),
 total_relationships: z.number().int().nonnegative(),
 total_dropped: z.number().int().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// NEO4J DATA SHAPE SCHEMAS
// ═══════════════════════════════════════════════════════════════

// ── Neo4j Entity Node ────────────────────────────────────────

export const Neo4jEntityNodeSchema = z.object({
 name: z.string().min(1),
 displayName: z.string().min(1),
 label: z.string().min(1),
 confidence: z.number().min(0).max(1),
 mentionCount: z.number().int().positive(),
 companyId: z.string().min(1),
 embedding: z.array(z.number()).length(768).nullable(),
});

// ── Neo4j Section Node ───────────────────────────────────────

export const Neo4jSectionNodeSchema = z.object({
 id: z.number().int().positive(),
 documentId: z.number().int().positive(),
});

// ── Neo4j Document Node (Req 9) ──────────────────────────────

export const Neo4jDocumentNodeSchema = z.object({
 id: z.number().int().positive(),
 name: z.string().min(1),
 companyId: z.string().min(1),
 uploadedAt: z.string().min(1),
});

// ── Neo4j Topic Node (Req 9) ─────────────────────────────────

export const Neo4jTopicNodeSchema = z.object({
 name: z.string().min(1),
 companyId: z.string().min(1),
 embedding: z.array(z.number()).length(768),
});

// ── Neo4j Community Node (Req 11) ────────────────────────────

export const Neo4jCommunityNodeSchema = z.object({
 id: z.number().int().nonnegative(),
 summary: z.string().min(1),
 companyId: z.string().min(1),
 embedding: z.array(z.number()).length(768),
});

// ── Neo4j Relationship Properties ────────────────────────────

export const Neo4jDynamicRelPropertiesSchema = z.object({
 weight: z.number().min(0).max(1),
 evidenceCount: z.number().int().positive(),
 detail: z.string().nullable(),
 documentId: z.number().int().positive(),
});

// ── Neo4j Write Result ───────────────────────────────────────

export const Neo4jWriteResultSchema = z.object({
 entities: z.number().int().nonnegative(),
 mentions: z.number().int().nonnegative(),
 relationships: z.number().int().nonnegative(),
 dynamicRelTypes: z.array(z.string()),
 durationMs: z.number().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// EMBEDDING PROVIDER SCHEMA
// ═══════════════════════════════════════════════════════════════

export const EmbeddingResultSchema = z.object({
 embedding: z.array(z.number()).min(1),
 dimensions: z.number().int().positive(),
 providerName: z.string().min(1),
});

export const EmbeddingBatchResultSchema = z.object({
 embeddings: z.array(z.array(z.number()).min(1)),
 dimensions: z.number().int().positive(),
 providerName: z.string().min(1),
});

// ═══════════════════════════════════════════════════════════════
// ENSEMBLE SEARCH SCHEMAS
// ═══════════════════════════════════════════════════════════════

export const GraphRetrieverResultSchema = z.object({
 sectionIds: z.array(z.number().int().positive()),
 entityMatchCount: z.number().int().nonnegative(),
 traversalHops: z.number().int().nonnegative(),
 durationMs: z.number().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// TYPE EXPORTS
// ═══════════════════════════════════════════════════════════════

export type EntityBase = z.infer<typeof EntityBaseSchema>;
export type EntityWithEmbedding = z.infer<typeof EntityWithEmbeddingSchema>;
export type ExtractionEntity = z.infer<typeof ExtractionEntitySchema>;
export type ExtractionRelationship = z.infer<typeof ExtractionRelationshipSchema>;
export type ExtractionChunkResult = z.infer<typeof ExtractionChunkResultSchema>;
export type ExtractRelationshipsResponse = z.infer<typeof ExtractRelationshipsResponseSchema>;
export type Neo4jEntityNode = z.infer<typeof Neo4jEntityNodeSchema>;
export type Neo4jDocumentNode = z.infer<typeof Neo4jDocumentNodeSchema>;
export type Neo4jTopicNode = z.infer<typeof Neo4jTopicNodeSchema>;
export type Neo4jCommunityNode = z.infer<typeof Neo4jCommunityNodeSchema>;
export type Neo4jWriteResult = z.infer<typeof Neo4jWriteResultSchema>;
export type EmbeddingResult = z.infer<typeof EmbeddingResultSchema>;
```

### Environment Variable Contract


All Neo4j-related environment variables across all requirements, with validation rules:

```typescript
// ═══════════════════════════════════════════════════════════════
// ENVIRONMENT VARIABLE CONTRACT
// ═══════════════════════════════════════════════════════════════
//
// All variables are OPTIONAL — the system works without any of them.
// Each variable gates a specific capability.
//
// ┌─────────────────────────────────┬──────────┬─────────────────────────────────────────┐
// │ Variable │ Required │ Gates │
// ├─────────────────────────────────┼──────────┼─────────────────────────────────────────┤
// │ NEO4J_URI │ No │ All Neo4j operations (Req 1,5,6,7,8,9) │
// │ NEO4J_USERNAME │ No │ Neo4j auth (default: "neo4j") │
// │ NEO4J_PASSWORD │ No │ Neo4j auth (default: "password") │
// │ EXTRACTION_LLM_BASE_URL │ No │ Relationship extraction (Req 4,5) │
// │ EXTRACTION_LLM_MODEL │ No │ LLM model selection (Req 4) │
// │ EMBEDDING_PROVIDER │ No │ Embedding backend: "bert" | "openai-compatible" │
// │ EMBEDDING_API_URL │ No │ External embedding API (Req 2) │
// │ EMBEDDING_MODEL │ No │ External embedding model name (Req 2) │
// │ EMBEDDING_DIMENSIONS │ No │ External embedding dimensions (Req 2) │
// │ ENABLE_GRAPH_RETRIEVER │ No │ Graph retriever in ensemble (Req 7,8) │
// │ ENTITY_RESOLUTION_THRESHOLD │ No │ Cosine similarity threshold (default: 0.85) │
// │ SIDECAR_URL │ No │ All sidecar operations (Req 3,4) │
// └─────────────────────────────────┴──────────┴─────────────────────────────────────────┘
//
// GATING LOGIC:
// - NEO4J_URI unset → skip ENTIRE graph pipeline (Steps F, F2, G, G2, G3, G4)
// No entity extraction, no relationship extraction,
// no graph data stored anywhere. Pipeline ends at Step E.
// - NEO4J_URI set + EXTRACTION_LLM_BASE_URL unset → skip relationship extraction only,
// graph has BERT entities with CO_OCCURS relationships
// - NEO4J_URI set + SIDECAR_URL unset → skip entity extraction, no graph data
// - ENABLE_GRAPH_RETRIEVER=false → ensemble uses only BM25+Vector (2-retriever)
// - EMBEDDING_PROVIDER unset → default to BERT via Sidecar (768-dim, free)
//
// MIGRATION FROM OLD SPEC:
// - GEMMA_BASE_URL → EXTRACTION_LLM_BASE_URL (model-agnostic naming)
// - GEMMA_MODEL → EXTRACTION_LLM_MODEL (model-agnostic naming)
// - Old vars remain in src/env.ts for backward compatibility during transition

// Zod validation schema additions for src/env.ts:
const neo4jGraphRAGEnvSchema = z.object({
 // Neo4j connection (optional — enables graph storage)
 NEO4J_URI: optionalString(),
 NEO4J_USERNAME: optionalString(),
 NEO4J_PASSWORD: optionalString(),

 // Extraction LLM (optional — enables relationship extraction)
 // Model-agnostic: works with Ollama, LM Studio, OpenAI, Azure, any OpenAI-compatible
 EXTRACTION_LLM_BASE_URL: optionalString(),
 EXTRACTION_LLM_MODEL: optionalString(),

 // Embedding provider (optional — defaults to BERT via Sidecar)
 EMBEDDING_PROVIDER: z.enum(["bert", "openai-compatible"]).optional(),
 EMBEDDING_API_URL: optionalString(),
 EMBEDDING_MODEL: optionalString(),
 EMBEDDING_DIMENSIONS: z.coerce.number().int().positive().optional(),

 // Graph retriever toggle
 ENABLE_GRAPH_RETRIEVER: z.preprocess(
 (val) => val === "true" || val === "1",
 z.boolean().optional()
 ),

 // Entity resolution
 ENTITY_RESOLUTION_THRESHOLD: z.coerce.number().min(0).max(1).optional(),

 // Backward compatibility (old spec names — deprecated)
 GEMMA_BASE_URL: optionalString(),
 GEMMA_MODEL: optionalString(),
});
```

### Ingestion Pipeline Contract

Step ordering with inputs/outputs and gating logic:

```
═══════════════════════════════════════════════════════════════
EXISTING STEPS (A-E) — UNCHANGED, NEVER MODIFIED
═══════════════════════════════════════════════════════════════

Step A: Upload
 Input: File (PDF, DOCX, etc.)
 Output: Raw file stored, ocrJob created
 Gate: Always runs

Step B: OCR / Normalize
 Input: Raw file
 Output: PageContent[] (text per page)
 Gate: Always runs

Step C: Chunk
 Input: PageContent[]
 Output: DocumentChunk[] (parent + child chunks)
 Gate: Always runs

Step D: Embed
 Input: DocumentChunk[]
 Output: VectorizedChunk[] (with 1536-dim OpenAI embeddings)
 Gate: Always runs (uses OpenAI or Sidecar)

Step E: Store Sections
 Input: VectorizedChunk[]
 Output: StoredSection[] {sectionId, content} in Postgres
 Gate: Always runs

═══════════════════════════════════════════════════════════════
NEO4J GRAPH PIPELINE — ALL GATED BY NEO4J_URI
═══════════════════════════════════════════════════════════════

When NEO4J_URI is NOT set, NONE of these steps run. No entity
extraction, no relationship extraction, no graph data stored
anywhere. The pipeline ends at Step E.

Step F: BERT NER Entity Extraction
 Input: StoredSection[] (from Step E)
 Output: Entities + CO_OCCURS relationships + CLS embeddings
 Gate: NEO4J_URI is set AND SIDECAR_URL is set AND sidecar is healthy
 Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres

Step F2: LLM Relationship Extraction (NEW)
 Input: StoredSection[] (same chunks as Step F)
 Output: Typed relationships (CEO_OF, ACQUIRED, etc.)
 + additional entities from LLM
 Gate: NEO4J_URI is set AND EXTRACTION_LLM_BASE_URL is set AND sidecar is healthy
 Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres
 Note: Independent of Step F — if BERT fails, LLM still runs

Step G: Neo4j Direct Write
 Input: All entities + relationships from Steps F and F2
 (passed directly in memory, NOT read from Postgres)
 Output: Neo4jWriteResult
 Gate: NEO4J_URI is set AND Neo4j is healthy
 Target: Neo4j graph store (direct MERGE queries)
 Note: Neo4j is the ONLY place graph data lives.
 Postgres kg_* tables are NOT used by this pipeline.

Step G2: Document Content Graph (NEW, Req 9)
 Input: Document metadata + Section IDs + extracted topics
 Output: Document, Topic nodes + CONTAINS, DISCUSSES, REFERENCES edges
 Gate: NEO4J_URI is set AND Neo4j is healthy
 Target: Neo4j graph store

Step G3: Entity Resolution (NEW, Req 6)
 Input: Newly written entities with embeddings
 Output: Merged entities + ALIAS_OF relationships
 Gate: NEO4J_URI is set AND entity-embeddings vector index exists
 Target: Neo4j graph store

Step G4: Community Detection (NEW, Req 11)
 Input: Entity graph in Neo4j
 Output: Community assignments + Community summary nodes
 Gate: NEO4J_URI is set AND GDS plugin available
 Target: Neo4j graph store

═══════════════════════════════════════════════════════════════
FAILURE ISOLATION
═══════════════════════════════════════════════════════════════

- Step F failure → Step F2 still attempts to run, Step G writes whatever data is available
- Step F2 failure → Step G still writes BERT-only data from Step F
- Step G failure → Graph data is lost for this document (nowhere else to store it)
 Document is still marked as successfully ingested (Postgres Steps A-E are source of truth)
- Step G2 failure → Entity graph is still valid, just no content graph
- Step G3 failure → Entities exist but may have duplicates
- Step G4 failure → Graph works, just no community summaries
```

### Ensemble Search Contract

How the graph retriever plugs into the existing ensemble:

```typescript
// ═══════════════════════════════════════════════════════════════
// ENSEMBLE SEARCH INTEGRATION
// ═══════════════════════════════════════════════════════════════

// Weight configurations
const WEIGHTS_2_RETRIEVER = [0.4, 0.6]; // BM25, Vector (current default)
const WEIGHTS_3_RETRIEVER = [0.3, 0.5, 0.2]; // BM25, Vector, Graph (static default)

// Dynamic weight adjustment (Req 8.4)
// When graph retriever returns results, adjust weights based on match quality:
// entityMatchCount >= 3 → graph weight increases to 0.3-0.5
// entityMatchCount 1-2 → graph weight stays at 0.1-0.2
// entityMatchCount 0 → graph weight drops to 0, effectively 2-retriever mode

interface GraphRetrieverMetrics {
 entityMatchCount: number;
 avgConfidence: number;
 traversalHops: number;
}

function computeDynamicWeights(metrics: GraphRetrieverMetrics): [number, number, number] {
 if (metrics.entityMatchCount >= 3 && metrics.avgConfidence > 0.7) {
 return [0.15, 0.35, 0.50]; // Strong graph signal
 }
 if (metrics.entityMatchCount >= 1) {
 return [0.30, 0.50, 0.20]; // Moderate graph signal
 }
 return [0.40, 0.60, 0.00]; // No graph signal → 2-retriever fallback
}

// Graceful degradation interface
// The ensemble MUST handle these failure modes without user-visible errors:
// 1. Neo4j unreachable → use 2-retriever weights, log warning
// 2. Graph retriever timeout (>2s) → use 2-retriever weights, log warning
// 3. Graph retriever returns empty → use 2-retriever weights (normal behavior)
// 4. Sidecar unreachable → skip reranking, return RRF-fused results
```

---

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Embedding output normalization

*For any* valid text input and any configured embedding backend (BERT or OpenAI-compatible), the `EmbeddingProvider.embed()` method SHALL return an array of floats with length equal to `provider.dimensions`, and `EmbeddingProvider.embedBatch()` SHALL return arrays all of the same dimensionality.

**Validates: Requirements 2.1, 2.4**

### Property 2: BERT CLS embedding dimensionality

*For any* text chunk that produces at least one entity, when `/extract-entities?include_embeddings=true` is called, every entity in the response SHALL have an `embedding` field that is an array of exactly 768 floating-point numbers.

**Validates: Requirements 3.1, 3.3**

### Property 3: Relationship extraction structural validity

*For any* extraction response from `POST /extract-relationships`, every relationship in the `relationships` array SHALL have `source` and `target` values that each match the `name` of an entity in the same chunk's `entities` array, AND every relationship `type` SHALL match the regex `^[A-Z][A-Z0-9_]*$`.

**Validates: Requirements 4.2, 4.3**

### Property 4: Think-tag stripping preserves JSON content

*For any* string containing `<think>...</think>` tags wrapping arbitrary text, followed by valid JSON content, the think-tag stripping function SHALL produce a string that is valid JSON-parseable.

**Validates: Requirement 4.7**

### Property 5: Neo4j entity write completeness

*For any* entity written to Neo4j via the direct writer, the resulting Entity node SHALL contain all required properties (`name`, `displayName`, `label`, `confidence`, `mentionCount`, `companyId`), and when the input entity has a non-null embedding, the node SHALL have an `embedding` property of length 768. For any entity mention, there SHALL exist a `MENTIONED_IN` edge from the Entity node to the corresponding Section node.

**Validates: Requirements 5.3, 5.5, 6.1**

### Property 6: Dynamic Cypher relationship types

*For any* relationship written to Neo4j, the Cypher relationship type SHALL be the actual relationship type string (e.g., `CEO_OF`, `ACQUIRED`, `CO_OCCURS`) and SHALL NOT be a generic `RELATES_TO` type with a `type` property.

**Validates: Requirement 5.4**

### Property 7: Neo4j write idempotence

*For any* set of entities and relationships, writing them to Neo4j twice via the direct writer SHALL produce the same graph state (same node count, same edge count, same property values) as writing them once.

**Validates: Requirement 5.7**

### Property 8: Entity resolution merges duplicates

*For any* two Entity nodes within the same `companyId` whose embedding cosine similarity exceeds the configured threshold (default 0.85), the entity resolution module SHALL merge them into a single canonical Entity node and create an `ALIAS_OF` relationship from the alias to the canonical entity.

**Validates: Requirements 6.3, 6.4**

### Property 9: Graph retriever traverses all relationship types

*For any* Neo4j graph containing entities connected by multiple relationship types (e.g., `CEO_OF`, `ACQUIRED`, `CO_OCCURS`), the graph retriever SHALL traverse all relationship types when finding connected entities — not just `CO_OCCURS`.

**Validates: Requirement 7.1**

### Property 10: Dynamic ensemble weight adjustment

*For any* graph retriever result with `entityMatchCount >= 3` and `avgConfidence > 0.7`, the ensemble SHALL assign a graph weight of at least 0.3. For any result with `entityMatchCount == 0`, the graph weight SHALL be 0 (effectively 2-retriever mode).

**Validates: Requirement 8.4**

### Property 11: Document content graph structure

*For any* document ingested with Neo4j enabled, there SHALL exist a Document node with the correct `id`, `name`, `companyId`, and `uploadedAt` properties, and for each section belonging to that document, there SHALL exist a `CONTAINS` relationship from the Document node to the Section node.

**Validates: Requirements 9.1, 9.2**

### Property 12: Cross-document and topic linking

*For any* two documents within the same `companyId` that share 3 or more entities, there SHALL exist a `REFERENCES` relationship between their Document nodes with a `sharedEntityCount` property. *For any* two Topic nodes within the same `companyId` with embedding cosine similarity above 0.8, there SHALL exist a `RELATED_TO` relationship.

**Validates: Requirements 9.4, 9.5**

### Property 13: Combined graph+vector scoring

*For any* retrieval result from the graph-guided hybrid retriever, the result score SHALL incorporate both graph proximity (inversely proportional to hop distance) and vector similarity (cosine similarity to query embedding).

**Validates: Requirement 10.3**

### Property 14: Community node data shape

*For any* detected community, there SHALL exist a Community node in Neo4j with `id`, `summary` (non-empty string), `companyId`, and `embedding` (768-dim float array) properties.

**Validates: Requirement 11.3**

### Property 15: Text2Cypher read-only validation

*For any* Cypher query generated by the Text2Cypher module, the read-only validator SHALL reject queries containing `CREATE`, `DELETE`, `SET`, `MERGE`, or `DETACH` keywords (case-insensitive), preventing accidental graph mutations.

**Validates: Requirement 12.4**

### Property 16: Postgres search independence

*For any* search query executed against the same dataset, the BM25+Vector retrieval results (excluding the graph retriever contribution) SHALL be identical whether Neo4j is enabled or disabled — Neo4j is purely additive and never modifies Postgres search behavior.

**Validates: Requirement 13.1**

---

## Error Handling

### Graceful Degradation Cascade

Every Neo4j feature follows a strict degradation hierarchy. No optional component failure should ever cause a user-visible error or block document ingestion.

```
Component Health Check Order (at pipeline start):
 1. Sidecar (/health) → gates Steps F, F2
 2. Neo4j (RETURN 1) → gates Steps G, G2, G3, G4
 3. Extraction LLM (/health or first call) → gates Step F2
 4. GDS plugin (CALL gds.graph.list()) → gates Step G4

Failure Handling:
┌──────────────────────────┬────────────────────────────────────────────┐
│ Component Down │ Behavior │
├──────────────────────────┼────────────────────────────────────────────┤
│ NEO4J_URI not set │ Entire graph pipeline disabled │
│ │ No entity extraction, no relationships │
│ │ No graph data stored anywhere │
│ │ Pipeline ends at Step E (Postgres only) │
├──────────────────────────┼────────────────────────────────────────────┤
│ Sidecar unreachable │ Skip entity extraction (Step F) │
│ (NEO4J_URI is set) │ No graph data for this document │
│ │ Document still ingested (Steps A-E ok) │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j unreachable │ Skip Neo4j writes (Step G) │
│ (NEO4J_URI is set) │ Entity extraction still runs but data │
│ │ is discarded (nowhere to write it) │
│ │ Ensemble search uses 2-retriever mode │
├──────────────────────────┼────────────────────────────────────────────┤
│ Extraction LLM down │ Skip relationship extraction (Step F2) │
│ │ Graph has BERT entities + CO_OCCURS only │
├──────────────────────────┼────────────────────────────────────────────┤
│ GDS plugin unavailable │ Skip community detection (Step G4) │
│ │ All other graph operations work normally │
├──────────────────────────┼────────────────────────────────────────────┤
│ Vector index missing │ Skip entity resolution (Step G3) │
│ │ Entities may have duplicates │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j down at query time │ Graph retriever returns empty [] │
│ │ Ensemble falls back to BM25+Vector │
│ │ No user-visible error │
├──────────────────────────┼────────────────────────────────────────────┤
│ External embedding API │ Fall back to BERT via Sidecar │
│ unreachable │ Log warning, continue with 768-dim │
└──────────────────────────┴────────────────────────────────────────────┘
```

### Error Logging Convention

All Neo4j-related errors use structured log prefixes for easy filtering:

```
[Neo4jWriter] — Direct write operations
[Neo4jRetriever] — Query-time graph retrieval
[EntityResolution] — Duplicate entity merging
[ExtractionLLM] — Relationship extraction via LLM
[EmbeddingProvider] — Embedding generation
[CommunityDetect] — Leiden clustering and summaries
[Text2Cypher] — Natural language to Cypher translation
```

---

## Testing Strategy

### Dual Testing Approach

- **Unit tests**: Specific examples, edge cases, error conditions, graceful degradation paths
- **Property tests**: Universal properties across all inputs (fast-check, minimum 100 iterations)
- **Contract tests**: Zod schema validation — runnable today with no infrastructure
- **Integration tests**: Cross-component data flow with mocked boundaries

### Parent Contract Test Strategy

These test files validate parent-level contracts. They are runnable immediately with no Neo4j, no Sidecar, no LLM — just Zod schema validation.

```
__tests__/api/graphrag/contracts/
├── graphrag-schemas.ts # Shared Zod schemas (source of truth)
├── contract.extract-entities.test.ts # BERT entity response shape validation
├── contract.extract-relationships.test.ts # LLM extraction response shape validation
├── contract.neo4j-write-result.test.ts # Neo4j write result shape validation
├── contract.neo4j-node-shapes.test.ts # Entity, Section, Document, Topic, Community node shapes
├── contract.embedding-provider.test.ts # Embedding result shape validation
├── contract.env-vars.test.ts # Environment variable schema validation
└── contract.ensemble-search.test.ts # Graph retriever result shape validation
```

| Test File | What It Validates | Runs Against |
|-----------|-------------------|--------------|
| `contract.extract-entities.test.ts` | BERT entity response (base + enhanced with embeddings) | `ExtractEntitiesResponseSchema`, `ExtractEntitiesEnhancedResponseSchema` |
| `contract.extract-relationships.test.ts` | LLM extraction response, entity/relationship shapes, SCREAMING_SNAKE_CASE, dropped_relationships | `ExtractRelationshipsResponseSchema`, `ExtractionEntitySchema`, `ExtractionRelationshipSchema` |
| `contract.neo4j-write-result.test.ts` | Neo4j write result shape, dynamicRelTypes array | `Neo4jWriteResultSchema` |
| `contract.neo4j-node-shapes.test.ts` | All Neo4j node types: Entity, Section, Document, Topic, Community | `Neo4jEntityNodeSchema`, `Neo4jSectionNodeSchema`, `Neo4jDocumentNodeSchema`, `Neo4jTopicNodeSchema`, `Neo4jCommunityNodeSchema` |
| `contract.embedding-provider.test.ts` | Embedding result shape, batch result shape, dimensionality | `EmbeddingResultSchema`, `EmbeddingBatchResultSchema` |
| `contract.env-vars.test.ts` | All new env vars parse correctly, optional vars accept undefined, backward compat with GEMMA_* vars | `neo4jGraphRAGEnvSchema` |
| `contract.ensemble-search.test.ts` | Graph retriever result shape, dynamic weight computation | `GraphRetrieverResultSchema` |

**Agent hook:** On sub-feature spec completion, the agent runs `pnpm test -- __tests__/api/graphrag/contracts/` to validate all parent contracts still pass.

### Property-Based Test Configuration

- Library: `fast-check` (already in devDependencies)
- Minimum 100 iterations per property
- Tag format: `Feature: neo4j-graphrag, Property {N}: {title}`
- Each correctness property maps to exactly one `*.pbt.test.ts` file
- Property tests are created in sub-feature specs, not in this parent spec

### Cross-Cutting Concern Tests

**Postgres Independence (Req 13):**
- Each sub-feature spec MUST include a test proving that Postgres search results are identical with Neo4j on vs off
- Test pattern: run the same query twice (once with `NEO4J_URI` set, once without), compare BM25+Vector results

**Graceful Degradation (Req 14):**
- Each sub-feature spec MUST include tests for every failure mode in the degradation cascade
- Test pattern: mock the failing component, verify the pipeline continues without errors

**Model-Agnostic Behavior (Reqs 2, 4):**
- Sub-feature specs for Reqs 2 and 4 MUST include tests proving no hardcoded model names or provider URLs
- Test pattern: configure different `EXTRACTION_LLM_BASE_URL` values, verify the same code path is used

Test File	What It Validates	Runs Against
`contract.extract-entities.test.ts`	BERT entity response (base + enhanced with embeddings)	`ExtractEntitiesResponseSchema`, `ExtractEntitiesEnhancedResponseSchema`
`contract.extract-relationships.test.ts`	LLM extraction response, entity/relationship shapes, SCREAMING_SNAKE_CASE, dropped_relationships	`ExtractRelationshipsResponseSchema`, `ExtractionEntitySchema`, `ExtractionRelationshipSchema`
`contract.neo4j-write-result.test.ts`	Neo4j write result shape, dynamicRelTypes array	`Neo4jWriteResultSchema`
`contract.neo4j-node-shapes.test.ts`	All Neo4j node types: Entity, Section, Document, Topic, Community	`Neo4jEntityNodeSchema`, `Neo4jSectionNodeSchema`, `Neo4jDocumentNodeSchema`, `Neo4jTopicNodeSchema`, `Neo4jCommunityNodeSchema`
`contract.embedding-provider.test.ts`	Embedding result shape, batch result shape, dimensionality	`EmbeddingResultSchema`, `EmbeddingBatchResultSchema`
`contract.env-vars.test.ts`	All new env vars parse correctly, optional vars accept undefined, backward compat with GEMMA_* vars	`neo4jGraphRAGEnvSchema`
`contract.ensemble-search.test.ts`	Graph retriever result shape, dynamic weight computation	`GraphRetrieverResultSchema`

Neo4j Full Implementation #275

Description

Design Document — Neo4j GraphRAG (Parent Vision Spec)

Overview

Architecture

System Diagram

Data Flow: Ingestion Pipeline

Data Flow: Query Pipeline

Feature Packet Dependency Graph

Components and Interfaces

1. Sidecar API Contracts

POST /extract-entities (Enhanced)

POST /extract-relationships (New)

2. Embedding Provider Interface

3. Neo4j Direct Writer Interface

Data Models

Neo4j Graph Schema

Shared Zod Schemas (Parent-Level Contracts)

Environment Variable Contract

Ingestion Pipeline Contract

Ensemble Search Contract

Correctness Properties

Property 1: Embedding output normalization

Property 2: BERT CLS embedding dimensionality

Property 3: Relationship extraction structural validity

Property 4: Think-tag stripping preserves JSON content

Property 5: Neo4j entity write completeness

Property 6: Dynamic Cypher relationship types

Property 7: Neo4j write idempotence

Property 8: Entity resolution merges duplicates

Property 9: Graph retriever traverses all relationship types

Property 10: Dynamic ensemble weight adjustment

Property 11: Document content graph structure

Property 12: Cross-document and topic linking

Property 13: Combined graph+vector scoring

Property 14: Community node data shape

Property 15: Text2Cypher read-only validation

Property 16: Postgres search independence

Error Handling

Graceful Degradation Cascade

Error Logging Convention

Testing Strategy

Dual Testing Approach

Parent Contract Test Strategy

Property-Based Test Configuration

Cross-Cutting Concern Tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`POST /extract-entities` (Enhanced)

`POST /extract-relationships` (New)