Skip to content

Neo4j Full Implementation #275

@kien-ship-it

Description

@kien-ship-it

Design Document — Neo4j GraphRAG (Parent Vision Spec)

Overview

This parent design defines the high-level architecture and shared contracts for the Neo4j GraphRAG system — an opt-in graph+vector search engine that enriches the existing Postgres-based semantic search pipeline. Neo4j operates as a completely independent feature set: when NEO4J_URI is set, the entire graph pipeline activates (entity extraction, relationship extraction, direct Neo4j writes); when it is NOT set, none of these steps run and no graph data is stored anywhere — not in Postgres, not anywhere. Neo4j is the ONLY place graph data lives. The existing BM25+Vector ensemble search continues unchanged regardless.

The system is model-agnostic: BERT (default, free, local) handles entity extraction and embeddings; any OpenAI-compatible LLM (Ollama, LM Studio, OpenAI, Azure) handles relationship extraction. Every component degrades gracefully when its dependencies are unavailable.

This design establishes a two-tier contract system:

  1. Parent contracts — shared Zod schemas, API interfaces, Neo4j node/relationship shapes, env var conventions that ALL sub-feature specs must conform to
  2. Sub-feature contracts — defined later in each sub-spec for workstream-level boundaries

Architecture

System Diagram

graph TB
    subgraph "Client"
        UI[Web App]
    end

    subgraph "Next.js App"
        API[API Routes]
        Ingestion[Ingestion Pipeline<br/>Steps A-G+]
        Ensemble[Ensemble Search<br/>BM25 + Vector + Graph]
        Reranker[Reranker Client]
    end

    subgraph "Data Stores"
        PG[(PostgreSQL + pgvector<br/>Source of Truth)]
        Neo4j[(Neo4j CE<br/>Graph + Vector<br/>Optional)]
    end

    subgraph "Sidecar (FastAPI)"
        Embed["/embed<br/>BERT Embeddings"]
        NER["/extract-entities<br/>BERT NER + CLS Embeddings"]
        RelEx["/extract-relationships<br/>Model-Agnostic LLM"]
        RerankModel["/rerank<br/>Cross-Encoder"]
    end

    subgraph "External (Optional)"
        Ollama[Ollama / LM Studio / OpenAI<br/>Any OpenAI-Compatible LLM]
        EmbAPI[External Embedding API<br/>Optional]
    end

    UI --> API
    API --> Ingestion
    API --> Ensemble

    Ingestion -->|Steps A-E: always runs| PG
    Ingestion -->|"Step F: BERT NER (only when NEO4J_URI set)"| NER
    Ingestion -->|"Step F2: Relationship Extraction (only when NEO4J_URI set)"| RelEx
    Ingestion -->|"Step G: Direct Write (only when NEO4J_URI set)"| Neo4j

    Ensemble -->|BM25 + Vector| PG
    Ensemble -->|"Graph Retriever (only when NEO4J_URI set)"| Neo4j
    Ensemble --> Reranker --> RerankModel

    RelEx -->|OpenAI-compatible API| Ollama
    NER -->|Local BERT model| NER

    Neo4j -.->|Section content lookup| PG

    style Neo4j fill:#e1f5fe,stroke:#0288d1
    style PG fill:#e8f5e9,stroke:#388e3c
    style Ollama fill:#fff3e0,stroke:#f57c00
Loading

Data Flow: Ingestion Pipeline

flowchart LR
    A[Upload] --> B[OCR/Normalize]
    B --> C[Chunk]
    C --> D[Embed via OpenAI or Sidecar]
    D --> E[Store Sections in Postgres]
    E --> CHECK{"NEO4J_URI set?"}
    CHECK -->|No| DONE[Pipeline Complete<br/>Postgres-only, no graph data anywhere]
    CHECK -->|Yes| F["Step F: BERT NER → Entities<br/>(+ CLS Embeddings)"]
    F --> F2["Step F2: LLM Relationship Extraction<br/>(gated: EXTRACTION_LLM_BASE_URL)"]
    F2 --> G["Step G: Direct Neo4j Write<br/>Entities + Relationships + Sections"]

    style G fill:#e1f5fe,stroke:#0288d1
    style F fill:#e1f5fe,stroke:#0288d1
    style F2 fill:#fff3e0,stroke:#f57c00
    style DONE fill:#e8f5e9,stroke:#388e3c
Loading

Key architectural decision: NEO4J_URI gates the ENTIRE graph feature set. When NEO4J_URI is NOT set, Steps F, F2, and G do not run at all — no entity extraction, no relationship extraction, no graph data stored anywhere (not in Postgres kg_* tables, not anywhere). The graph pipeline simply does not exist for users who don't enable Neo4j. When NEO4J_URI IS set, entities and relationships are extracted and written directly to Neo4j — Neo4j is the ONLY place graph data lives. The Postgres kg_* tables are a legacy artifact from the old architecture and are not used by this feature.

Data Flow: Query Pipeline

flowchart LR
    Q[User Query] --> E1[BERT NER on Query<br/>Extract Entities + Embedding]
    E1 --> ES{Ensemble Search}
    ES --> BM25[BM25 Retriever<br/>Postgres]
    ES --> Vec[Vector Retriever<br/>pgvector]
    ES --> GR["Graph Retriever<br/>Neo4j (when enabled)"]
    BM25 --> RRF[Reciprocal Rank Fusion]
    Vec --> RRF
    GR --> RRF
    RRF --> Rerank[Sidecar Cross-Encoder Rerank]
    Rerank --> Answer[LLM Answer Generation]

    style GR fill:#e1f5fe,stroke:#0288d1
Loading

Feature Packet Dependency Graph

graph TD
    R1[Req 1: Neo4j Docker Infrastructure] --> R5
    R2[Req 2: Embedding Provider Abstraction] --> R6
    R3[Req 3: BERT Entity CLS Embeddings] --> R5
    R4[Req 4: Relationship Extraction Endpoint] --> R5
    R5[Req 5: Direct Neo4j Write Pipeline] --> R6
    R5 --> R7
    R5 --> R9
    R6[Req 6: Vector Index + Entity Resolution] --> R7
    R6 --> R10
    R7[Req 7: Enhanced Graph Retriever] --> R8
    R7 --> R10
    R8[Req 8: Modular Ensemble Search]
    R9[Req 9: Document Content Graph] --> R10
    R10[Req 10: Graph-Guided Hybrid Retrieval] --> R11
    R10 --> R12
    R11[Req 11: Community Detection]
    R12[Req 12: Text2Cypher]

    R13[Req 13: Postgres Independence<br/>Cross-Cutting] -.-> R5
    R13 -.-> R8
    R14[Req 14: Graceful Degradation<br/>Cross-Cutting] -.-> R5
    R14 -.-> R7
    R14 -.-> R8

    style R1 fill:#c8e6c9
    style R2 fill:#c8e6c9
    style R3 fill:#c8e6c9
    style R4 fill:#c8e6c9
    style R13 fill:#ffecb3
    style R14 fill:#ffecb3

    classDef foundation fill:#c8e6c9,stroke:#388e3c
    classDef crosscut fill:#ffecb3,stroke:#f57c00
Loading

Critical path: R1 → R5 → R6 → R7 → R10 → R11/R12

Parallelizable foundations (no dependencies): R1, R2, R3, R4 can all be built simultaneously.


Components and Interfaces

1. Sidecar API Contracts

POST /extract-entities (Enhanced)

Existing endpoint enhanced with optional CLS embeddings. Backward compatible.

// Request (unchanged)
interface ExtractEntitiesRequest {
  chunks: string[];
}

// Response when called WITHOUT ?include_embeddings=true (unchanged)
interface ExtractEntitiesResponse {
  results: { text: string; entities: { text: string; label: string; score: number }[] }[];
  total_entities: number;
}

// Response when called WITH ?include_embeddings=true (enhanced)
interface ExtractEntitiesEnhancedResponse {
  results: {
    text: string;
    entities: {
      text: string;
      label: string;
      score: number;
      embedding: number[]; // 768-dim BERT CLS vector
    }[];
  }[];
  total_entities: number;
}

POST /extract-relationships (New)

Model-agnostic relationship extraction via any OpenAI-compatible LLM.

interface ExtractRelationshipsRequest {
  chunks: string[];
  known_entities?: string[]; // Optional: constrain relationship targets
}

interface ExtractionEntity {
  name: string;
  type: "PERSON" | "ORGANIZATION" | "LOCATION" | "PRODUCT" | "EVENT" | "OTHER";
}

interface ExtractionRelationship {
  source: string;   // Must match an entity name
  target: string;   // Must match an entity name
  type: string;     // SCREAMING_SNAKE_CASE: ^[A-Z][A-Z0-9_]*$
  detail: string;   // Brief evidence description
}

interface ExtractionChunkResult {
  text: string;
  entities: ExtractionEntity[];
  relationships: ExtractionRelationship[];
  dropped_relationships: ExtractionRelationship[]; // Invalid source/target
}

interface ExtractRelationshipsResponse {
  results: ExtractionChunkResult[];
  total_entities: number;
  total_relationships: number;
  total_dropped: number;
}

2. Embedding Provider Interface

/** Model-agnostic embedding provider — all sub-features use this interface */
interface EmbeddingProvider {
  /** Embed a single text string */
  embed(text: string): Promise<number[]>;
  /** Embed multiple texts in a batch */
  embedBatch(texts: string[]): Promise<number[][]>;
  /** The dimensionality of output vectors */
  readonly dimensions: number;
  /** The provider name for logging */
  readonly providerName: string;
}

/** Factory: reads env vars, returns the appropriate provider */
function createEmbeddingProvider(): EmbeddingProvider;
// - No EMBEDDING_PROVIDER or EMBEDDING_PROVIDER=bert → BertEmbeddingProvider (768-dim, calls Sidecar /embed)
// - EMBEDDING_PROVIDER=openai-compatible → OpenAICompatibleEmbeddingProvider (configurable dim)
// - Fallback: if external API unreachable → BertEmbeddingProvider with warning log

3. Neo4j Direct Writer Interface

/** Replaces the old neo4j-sync.ts "read from Postgres" pattern */
interface Neo4jDirectWriter {
  /** Write entities directly to Neo4j (idempotent MERGE) */
  writeEntities(entities: Neo4jEntityInput[], companyId: string): Promise<number>;
  /** Write relationships with dynamic Cypher types (idempotent MERGE) */
  writeRelationships(relationships: Neo4jRelationshipInput[], companyId: string): Promise<string[]>;
  /** Write section nodes and MENTIONED_IN edges */
  writeMentions(mentions: Neo4jMentionInput[], companyId: string): Promise<number>;
  /** Write document content graph nodes (Document, Topic, cross-doc links) */
  writeDocumentGraph(doc: Neo4jDocumentGraphInput, companyId: string): Promise<void>;
  /** Ensure vector indexes exist (idempotent) */
  ensureIndexes(): Promise<void>;
}

interface Neo4jEntityInput {
  name: string;           // normalized lowercase
  displayName: string;    // original casing
  label: string;          // PER, ORG, LOC, PRODUCT, EVENT, MISC, OTHER
  confidence: number;
  mentionCount: number;
  companyId: string;
  embedding?: number[];   // 768-dim BERT CLS vector (nullable)
}

interface Neo4jRelationshipInput {
  sourceName: string;
  sourceLabel: string;
  targetName: string;
  targetLabel: string;
  relationType: string;   // SCREAMING_SNAKE_CASE dynamic type
  weight: number;
  evidenceCount: number;
  detail?: string;
  documentId: number;
  companyId: string;
}

interface Neo4jMentionInput {
  entityName: string;
  entityLabel: string;
  sectionId: number;
  documentId: number;
  confidence: number;
  companyId: string;
}

interface Neo4jWriteResult {
  entities: number;
  mentions: number;
  relationships: number;
  dynamicRelTypes: string[];
  durationMs: number;
}

Data Models

Neo4j Graph Schema

-- ═══════════════════════════════════════════════════════════════
-- NODE TYPES
-- ═══════════════════════════════════════════════════════════════

-- Entity node (core knowledge graph node)
(:Entity {
  name: String,            -- normalized lowercase (MERGE key)
  displayName: String,     -- original casing
  label: String,           -- PER, ORG, LOC, PRODUCT, EVENT, MISC, OTHER (MERGE key)
  confidence: Float,       -- average extraction confidence
  mentionCount: Integer,   -- total mentions across all documents
  companyId: String,       -- company scope (MERGE key)
  embedding: List<Float>,  -- 768-dim BERT CLS vector (nullable)
  communityId: Integer     -- Leiden community assignment (nullable, Req 11)
})

-- Section node (lightweightcontent stays in Postgres)
(:Section {
  id: Integer,             -- matches documentSections.id in Postgres
  documentId: Integer      -- matches document.id in Postgres
})

-- Document node (Req 9: content graph)
(:Document {
  id: Integer,             -- matches document.id in Postgres
  name: String,            -- document name
  companyId: String,       -- company scope
  uploadedAt: String       -- ISO timestamp
})

-- Topic node (Req 9: content graph)
(:Topic {
  name: String,            -- topic name
  companyId: String,       -- company scope
  embedding: List<Float>   -- 768-dim embedding for similarity
})

-- Community node (Req 11: community detection)
(:Community {
  id: Integer,             -- community identifier
  summary: String,         -- LLM-generated 2-3 sentence summary
  companyId: String,       -- company scope
  embedding: List<Float>   -- embedding of the summary text
})

-- ═══════════════════════════════════════════════════════════════
-- RELATIONSHIP TYPES
-- ═══════════════════════════════════════════════════════════════

-- EntitySection (mentions)
(:Entity)-[:MENTIONED_IN {confidence: Float}]->(:Section)

-- EntityEntity (dynamic typesNOT generic RELATES_TO)
-- Examples: CEO_OF, ACQUIRED, HEADQUARTERED_IN, COMPETES_WITH, CO_OCCURS
(:Entity)-[:<DYNAMIC_TYPE> {
  weight: Float,
  evidenceCount: Integer,
  detail: String,          -- evidence text (nullable)
  documentId: Integer      -- source document
}]->(:Entity)

-- Entity resolution (Req 6)
(:Entity)-[:ALIAS_OF]->(:Entity)  -- aliascanonical

-- Document content graph (Req 9)
(:Document)-[:CONTAINS]->(:Section)
(:Section)-[:DISCUSSES]->(:Topic)
(:Document)-[:REFERENCES {sharedEntityCount: Integer}]->(:Document)
(:Topic)-[:RELATED_TO]->(:Topic)

-- Community membership (Req 11)
(:Entity)-[:BELONGS_TO]->(:Community)

-- ═══════════════════════════════════════════════════════════════
-- VECTOR INDEXES
-- ═══════════════════════════════════════════════════════════════

-- Entity embeddings (768-dim, cosine similarity)
CREATE VECTOR INDEX `entity-embeddings` IF NOT EXISTS
FOR (e:Entity) ON (e.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};

-- Topic embeddings (768-dim, cosine similarity) — Req 9
CREATE VECTOR INDEX `topic-embeddings` IF NOT EXISTS
FOR (t:Topic) ON (t.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};

-- Community summary embeddingsReq 11
CREATE VECTOR INDEX `community-embeddings` IF NOT EXISTS
FOR (c:Community) ON (c.embedding)
OPTIONS {indexConfig: {`vector.dimensions`: 768, `vector.similarity_function`: 'cosine'}};

Shared Zod Schemas (Parent-Level Contracts)

These are the runtime-validated contracts that ALL sub-feature specs must conform to. Updated from the old gemma-schemas.ts with model-agnostic naming and new node types.

import { z } from "zod";

// ═══════════════════════════════════════════════════════════════
// SIDECAR API SCHEMAS
// ═══════════════════════════════════════════════════════════════

// ── BERT Entity (base, without embedding) ────────────────────

export const EntityBaseSchema = z.object({
  text: z.string().min(1),
  label: z.string().min(1),
  score: z.number().min(0).max(1),
});

// ── BERT Entity (with CLS embedding) ────────────────────────

export const EntityWithEmbeddingSchema = EntityBaseSchema.extend({
  embedding: z.array(z.number()).length(768),
});

// ── Extract Entities Response (base, backward compatible) ────

export const ExtractEntitiesResponseSchema = z.object({
  results: z.array(z.object({
    text: z.string(),
    entities: z.array(EntityBaseSchema),
  })),
  total_entities: z.number().int().nonnegative(),
});

// ── Extract Entities Enhanced Response (with embeddings) ─────

export const ExtractEntitiesEnhancedResponseSchema = z.object({
  results: z.array(z.object({
    text: z.string(),
    entities: z.array(EntityWithEmbeddingSchema),
  })),
  total_entities: z.number().int().nonnegative(),
});

// ── Extraction Entity (LLM-extracted, model-agnostic) ────────

export const ExtractionEntitySchema = z.object({
  name: z.string().min(1),
  type: z.enum(["PERSON", "ORGANIZATION", "LOCATION", "PRODUCT", "EVENT", "OTHER"]),
});

// ── Extraction Relationship ──────────────────────────────────

export const ExtractionRelationshipSchema = z.object({
  source: z.string().min(1),
  target: z.string().min(1),
  type: z.string().min(1).regex(/^[A-Z][A-Z0-9_]*$/), // SCREAMING_SNAKE_CASE
  detail: z.string(),
});

// ── Extraction Chunk Result ──────────────────────────────────

export const ExtractionChunkResultSchema = z.object({
  text: z.string(),
  entities: z.array(ExtractionEntitySchema),
  relationships: z.array(ExtractionRelationshipSchema),
  dropped_relationships: z.array(ExtractionRelationshipSchema),
});

// ── Extract Relationships Response ───────────────────────────

export const ExtractRelationshipsResponseSchema = z.object({
  results: z.array(ExtractionChunkResultSchema),
  total_entities: z.number().int().nonnegative(),
  total_relationships: z.number().int().nonnegative(),
  total_dropped: z.number().int().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// NEO4J DATA SHAPE SCHEMAS
// ═══════════════════════════════════════════════════════════════

// ── Neo4j Entity Node ────────────────────────────────────────

export const Neo4jEntityNodeSchema = z.object({
  name: z.string().min(1),
  displayName: z.string().min(1),
  label: z.string().min(1),
  confidence: z.number().min(0).max(1),
  mentionCount: z.number().int().positive(),
  companyId: z.string().min(1),
  embedding: z.array(z.number()).length(768).nullable(),
});

// ── Neo4j Section Node ───────────────────────────────────────

export const Neo4jSectionNodeSchema = z.object({
  id: z.number().int().positive(),
  documentId: z.number().int().positive(),
});

// ── Neo4j Document Node (Req 9) ──────────────────────────────

export const Neo4jDocumentNodeSchema = z.object({
  id: z.number().int().positive(),
  name: z.string().min(1),
  companyId: z.string().min(1),
  uploadedAt: z.string().min(1),
});

// ── Neo4j Topic Node (Req 9) ─────────────────────────────────

export const Neo4jTopicNodeSchema = z.object({
  name: z.string().min(1),
  companyId: z.string().min(1),
  embedding: z.array(z.number()).length(768),
});

// ── Neo4j Community Node (Req 11) ────────────────────────────

export const Neo4jCommunityNodeSchema = z.object({
  id: z.number().int().nonnegative(),
  summary: z.string().min(1),
  companyId: z.string().min(1),
  embedding: z.array(z.number()).length(768),
});

// ── Neo4j Relationship Properties ────────────────────────────

export const Neo4jDynamicRelPropertiesSchema = z.object({
  weight: z.number().min(0).max(1),
  evidenceCount: z.number().int().positive(),
  detail: z.string().nullable(),
  documentId: z.number().int().positive(),
});

// ── Neo4j Write Result ───────────────────────────────────────

export const Neo4jWriteResultSchema = z.object({
  entities: z.number().int().nonnegative(),
  mentions: z.number().int().nonnegative(),
  relationships: z.number().int().nonnegative(),
  dynamicRelTypes: z.array(z.string()),
  durationMs: z.number().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// EMBEDDING PROVIDER SCHEMA
// ═══════════════════════════════════════════════════════════════

export const EmbeddingResultSchema = z.object({
  embedding: z.array(z.number()).min(1),
  dimensions: z.number().int().positive(),
  providerName: z.string().min(1),
});

export const EmbeddingBatchResultSchema = z.object({
  embeddings: z.array(z.array(z.number()).min(1)),
  dimensions: z.number().int().positive(),
  providerName: z.string().min(1),
});

// ═══════════════════════════════════════════════════════════════
// ENSEMBLE SEARCH SCHEMAS
// ═══════════════════════════════════════════════════════════════

export const GraphRetrieverResultSchema = z.object({
  sectionIds: z.array(z.number().int().positive()),
  entityMatchCount: z.number().int().nonnegative(),
  traversalHops: z.number().int().nonnegative(),
  durationMs: z.number().nonnegative(),
});

// ═══════════════════════════════════════════════════════════════
// TYPE EXPORTS
// ═══════════════════════════════════════════════════════════════

export type EntityBase = z.infer<typeof EntityBaseSchema>;
export type EntityWithEmbedding = z.infer<typeof EntityWithEmbeddingSchema>;
export type ExtractionEntity = z.infer<typeof ExtractionEntitySchema>;
export type ExtractionRelationship = z.infer<typeof ExtractionRelationshipSchema>;
export type ExtractionChunkResult = z.infer<typeof ExtractionChunkResultSchema>;
export type ExtractRelationshipsResponse = z.infer<typeof ExtractRelationshipsResponseSchema>;
export type Neo4jEntityNode = z.infer<typeof Neo4jEntityNodeSchema>;
export type Neo4jDocumentNode = z.infer<typeof Neo4jDocumentNodeSchema>;
export type Neo4jTopicNode = z.infer<typeof Neo4jTopicNodeSchema>;
export type Neo4jCommunityNode = z.infer<typeof Neo4jCommunityNodeSchema>;
export type Neo4jWriteResult = z.infer<typeof Neo4jWriteResultSchema>;
export type EmbeddingResult = z.infer<typeof EmbeddingResultSchema>;

Environment Variable Contract

All Neo4j-related environment variables across all requirements, with validation rules:

// ═══════════════════════════════════════════════════════════════
// ENVIRONMENT VARIABLE CONTRACT
// ═══════════════════════════════════════════════════════════════
//
// All variables are OPTIONAL — the system works without any of them.
// Each variable gates a specific capability.
//
// ┌─────────────────────────────────┬──────────┬─────────────────────────────────────────┐
// │ Variable                        │ Required │ Gates                                   │
// ├─────────────────────────────────┼──────────┼─────────────────────────────────────────┤
// │ NEO4J_URI                       │ No       │ All Neo4j operations (Req 1,5,6,7,8,9)  │
// │ NEO4J_USERNAME                  │ No       │ Neo4j auth (default: "neo4j")            │
// │ NEO4J_PASSWORD                  │ No       │ Neo4j auth (default: "password")         │
// │ EXTRACTION_LLM_BASE_URL         │ No       │ Relationship extraction (Req 4,5)        │
// │ EXTRACTION_LLM_MODEL            │ No       │ LLM model selection (Req 4)              │
// │ EMBEDDING_PROVIDER              │ No       │ Embedding backend: "bert" | "openai-compatible" │
// │ EMBEDDING_API_URL               │ No       │ External embedding API (Req 2)           │
// │ EMBEDDING_MODEL                 │ No       │ External embedding model name (Req 2)    │
// │ EMBEDDING_DIMENSIONS            │ No       │ External embedding dimensions (Req 2)    │
// │ ENABLE_GRAPH_RETRIEVER          │ No       │ Graph retriever in ensemble (Req 7,8)    │
// │ ENTITY_RESOLUTION_THRESHOLD     │ No       │ Cosine similarity threshold (default: 0.85) │
// │ SIDECAR_URL                     │ No       │ All sidecar operations (Req 3,4)         │
// └─────────────────────────────────┴──────────┴─────────────────────────────────────────┘
//
// GATING LOGIC:
//   - NEO4J_URI unset        → skip ENTIRE graph pipeline (Steps F, F2, G, G2, G3, G4)
//                               No entity extraction, no relationship extraction,
//                               no graph data stored anywhere. Pipeline ends at Step E.
//   - NEO4J_URI set + EXTRACTION_LLM_BASE_URL unset → skip relationship extraction only,
//                               graph has BERT entities with CO_OCCURS relationships
//   - NEO4J_URI set + SIDECAR_URL unset → skip entity extraction, no graph data
//   - ENABLE_GRAPH_RETRIEVER=false → ensemble uses only BM25+Vector (2-retriever)
//   - EMBEDDING_PROVIDER unset → default to BERT via Sidecar (768-dim, free)
//
// MIGRATION FROM OLD SPEC:
//   - GEMMA_BASE_URL → EXTRACTION_LLM_BASE_URL (model-agnostic naming)
//   - GEMMA_MODEL    → EXTRACTION_LLM_MODEL    (model-agnostic naming)
//   - Old vars remain in src/env.ts for backward compatibility during transition

// Zod validation schema additions for src/env.ts:
const neo4jGraphRAGEnvSchema = z.object({
  // Neo4j connection (optional — enables graph storage)
  NEO4J_URI: optionalString(),
  NEO4J_USERNAME: optionalString(),
  NEO4J_PASSWORD: optionalString(),

  // Extraction LLM (optional — enables relationship extraction)
  // Model-agnostic: works with Ollama, LM Studio, OpenAI, Azure, any OpenAI-compatible
  EXTRACTION_LLM_BASE_URL: optionalString(),
  EXTRACTION_LLM_MODEL: optionalString(),

  // Embedding provider (optional — defaults to BERT via Sidecar)
  EMBEDDING_PROVIDER: z.enum(["bert", "openai-compatible"]).optional(),
  EMBEDDING_API_URL: optionalString(),
  EMBEDDING_MODEL: optionalString(),
  EMBEDDING_DIMENSIONS: z.coerce.number().int().positive().optional(),

  // Graph retriever toggle
  ENABLE_GRAPH_RETRIEVER: z.preprocess(
    (val) => val === "true" || val === "1",
    z.boolean().optional()
  ),

  // Entity resolution
  ENTITY_RESOLUTION_THRESHOLD: z.coerce.number().min(0).max(1).optional(),

  // Backward compatibility (old spec names — deprecated)
  GEMMA_BASE_URL: optionalString(),
  GEMMA_MODEL: optionalString(),
});

Ingestion Pipeline Contract

Step ordering with inputs/outputs and gating logic:

═══════════════════════════════════════════════════════════════
EXISTING STEPS (A-E) — UNCHANGED, NEVER MODIFIED
═══════════════════════════════════════════════════════════════

Step A: Upload
  Input:  File (PDF, DOCX, etc.)
  Output: Raw file stored, ocrJob created
  Gate:   Always runs

Step B: OCR / Normalize
  Input:  Raw file
  Output: PageContent[] (text per page)
  Gate:   Always runs

Step C: Chunk
  Input:  PageContent[]
  Output: DocumentChunk[] (parent + child chunks)
  Gate:   Always runs

Step D: Embed
  Input:  DocumentChunk[]
  Output: VectorizedChunk[] (with 1536-dim OpenAI embeddings)
  Gate:   Always runs (uses OpenAI or Sidecar)

Step E: Store Sections
  Input:  VectorizedChunk[]
  Output: StoredSection[] {sectionId, content} in Postgres
  Gate:   Always runs

═══════════════════════════════════════════════════════════════
NEO4J GRAPH PIPELINE — ALL GATED BY NEO4J_URI
═══════════════════════════════════════════════════════════════

When NEO4J_URI is NOT set, NONE of these steps run. No entity
extraction, no relationship extraction, no graph data stored
anywhere. The pipeline ends at Step E.

Step F: BERT NER Entity Extraction
  Input:  StoredSection[] (from Step E)
  Output: Entities + CO_OCCURS relationships + CLS embeddings
  Gate:   NEO4J_URI is set AND SIDECAR_URL is set AND sidecar is healthy
  Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres

Step F2: LLM Relationship Extraction (NEW)
  Input:  StoredSection[] (same chunks as Step F)
  Output: Typed relationships (CEO_OF, ACQUIRED, etc.)
          + additional entities from LLM
  Gate:   NEO4J_URI is set AND EXTRACTION_LLM_BASE_URL is set AND sidecar is healthy
  Target: Passed directly to Step G (Neo4j) — NOT stored in Postgres
  Note:   Independent of Step F — if BERT fails, LLM still runs

Step G: Neo4j Direct Write
  Input:  All entities + relationships from Steps F and F2
          (passed directly in memory, NOT read from Postgres)
  Output: Neo4jWriteResult
  Gate:   NEO4J_URI is set AND Neo4j is healthy
  Target: Neo4j graph store (direct MERGE queries)
  Note:   Neo4j is the ONLY place graph data lives.
          Postgres kg_* tables are NOT used by this pipeline.

Step G2: Document Content Graph (NEW, Req 9)
  Input:  Document metadata + Section IDs + extracted topics
  Output: Document, Topic nodes + CONTAINS, DISCUSSES, REFERENCES edges
  Gate:   NEO4J_URI is set AND Neo4j is healthy
  Target: Neo4j graph store

Step G3: Entity Resolution (NEW, Req 6)
  Input:  Newly written entities with embeddings
  Output: Merged entities + ALIAS_OF relationships
  Gate:   NEO4J_URI is set AND entity-embeddings vector index exists
  Target: Neo4j graph store

Step G4: Community Detection (NEW, Req 11)
  Input:  Entity graph in Neo4j
  Output: Community assignments + Community summary nodes
  Gate:   NEO4J_URI is set AND GDS plugin available
  Target: Neo4j graph store

═══════════════════════════════════════════════════════════════
FAILURE ISOLATION
═══════════════════════════════════════════════════════════════

- Step F failure  → Step F2 still attempts to run, Step G writes whatever data is available
- Step F2 failure → Step G still writes BERT-only data from Step F
- Step G failure  → Graph data is lost for this document (nowhere else to store it)
                    Document is still marked as successfully ingested (Postgres Steps A-E are source of truth)
- Step G2 failure → Entity graph is still valid, just no content graph
- Step G3 failure → Entities exist but may have duplicates
- Step G4 failure → Graph works, just no community summaries

Ensemble Search Contract

How the graph retriever plugs into the existing ensemble:

// ═══════════════════════════════════════════════════════════════
// ENSEMBLE SEARCH INTEGRATION
// ═══════════════════════════════════════════════════════════════

// Weight configurations
const WEIGHTS_2_RETRIEVER = [0.4, 0.6];           // BM25, Vector (current default)
const WEIGHTS_3_RETRIEVER = [0.3, 0.5, 0.2];      // BM25, Vector, Graph (static default)

// Dynamic weight adjustment (Req 8.4)
// When graph retriever returns results, adjust weights based on match quality:
//   entityMatchCount >= 3 → graph weight increases to 0.3-0.5
//   entityMatchCount 1-2  → graph weight stays at 0.1-0.2
//   entityMatchCount 0    → graph weight drops to 0, effectively 2-retriever mode

interface GraphRetrieverMetrics {
  entityMatchCount: number;
  avgConfidence: number;
  traversalHops: number;
}

function computeDynamicWeights(metrics: GraphRetrieverMetrics): [number, number, number] {
  if (metrics.entityMatchCount >= 3 && metrics.avgConfidence > 0.7) {
    return [0.15, 0.35, 0.50]; // Strong graph signal
  }
  if (metrics.entityMatchCount >= 1) {
    return [0.30, 0.50, 0.20]; // Moderate graph signal
  }
  return [0.40, 0.60, 0.00];   // No graph signal → 2-retriever fallback
}

// Graceful degradation interface
// The ensemble MUST handle these failure modes without user-visible errors:
//   1. Neo4j unreachable → use 2-retriever weights, log warning
//   2. Graph retriever timeout (>2s) → use 2-retriever weights, log warning
//   3. Graph retriever returns empty → use 2-retriever weights (normal behavior)
//   4. Sidecar unreachable → skip reranking, return RRF-fused results

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Embedding output normalization

For any valid text input and any configured embedding backend (BERT or OpenAI-compatible), the EmbeddingProvider.embed() method SHALL return an array of floats with length equal to provider.dimensions, and EmbeddingProvider.embedBatch() SHALL return arrays all of the same dimensionality.

Validates: Requirements 2.1, 2.4

Property 2: BERT CLS embedding dimensionality

For any text chunk that produces at least one entity, when /extract-entities?include_embeddings=true is called, every entity in the response SHALL have an embedding field that is an array of exactly 768 floating-point numbers.

Validates: Requirements 3.1, 3.3

Property 3: Relationship extraction structural validity

For any extraction response from POST /extract-relationships, every relationship in the relationships array SHALL have source and target values that each match the name of an entity in the same chunk's entities array, AND every relationship type SHALL match the regex ^[A-Z][A-Z0-9_]*$.

Validates: Requirements 4.2, 4.3

Property 4: Think-tag stripping preserves JSON content

For any string containing <think>...</think> tags wrapping arbitrary text, followed by valid JSON content, the think-tag stripping function SHALL produce a string that is valid JSON-parseable.

Validates: Requirement 4.7

Property 5: Neo4j entity write completeness

For any entity written to Neo4j via the direct writer, the resulting Entity node SHALL contain all required properties (name, displayName, label, confidence, mentionCount, companyId), and when the input entity has a non-null embedding, the node SHALL have an embedding property of length 768. For any entity mention, there SHALL exist a MENTIONED_IN edge from the Entity node to the corresponding Section node.

Validates: Requirements 5.3, 5.5, 6.1

Property 6: Dynamic Cypher relationship types

For any relationship written to Neo4j, the Cypher relationship type SHALL be the actual relationship type string (e.g., CEO_OF, ACQUIRED, CO_OCCURS) and SHALL NOT be a generic RELATES_TO type with a type property.

Validates: Requirement 5.4

Property 7: Neo4j write idempotence

For any set of entities and relationships, writing them to Neo4j twice via the direct writer SHALL produce the same graph state (same node count, same edge count, same property values) as writing them once.

Validates: Requirement 5.7

Property 8: Entity resolution merges duplicates

For any two Entity nodes within the same companyId whose embedding cosine similarity exceeds the configured threshold (default 0.85), the entity resolution module SHALL merge them into a single canonical Entity node and create an ALIAS_OF relationship from the alias to the canonical entity.

Validates: Requirements 6.3, 6.4

Property 9: Graph retriever traverses all relationship types

For any Neo4j graph containing entities connected by multiple relationship types (e.g., CEO_OF, ACQUIRED, CO_OCCURS), the graph retriever SHALL traverse all relationship types when finding connected entities — not just CO_OCCURS.

Validates: Requirement 7.1

Property 10: Dynamic ensemble weight adjustment

For any graph retriever result with entityMatchCount >= 3 and avgConfidence > 0.7, the ensemble SHALL assign a graph weight of at least 0.3. For any result with entityMatchCount == 0, the graph weight SHALL be 0 (effectively 2-retriever mode).

Validates: Requirement 8.4

Property 11: Document content graph structure

For any document ingested with Neo4j enabled, there SHALL exist a Document node with the correct id, name, companyId, and uploadedAt properties, and for each section belonging to that document, there SHALL exist a CONTAINS relationship from the Document node to the Section node.

Validates: Requirements 9.1, 9.2

Property 12: Cross-document and topic linking

For any two documents within the same companyId that share 3 or more entities, there SHALL exist a REFERENCES relationship between their Document nodes with a sharedEntityCount property. For any two Topic nodes within the same companyId with embedding cosine similarity above 0.8, there SHALL exist a RELATED_TO relationship.

Validates: Requirements 9.4, 9.5

Property 13: Combined graph+vector scoring

For any retrieval result from the graph-guided hybrid retriever, the result score SHALL incorporate both graph proximity (inversely proportional to hop distance) and vector similarity (cosine similarity to query embedding).

Validates: Requirement 10.3

Property 14: Community node data shape

For any detected community, there SHALL exist a Community node in Neo4j with id, summary (non-empty string), companyId, and embedding (768-dim float array) properties.

Validates: Requirement 11.3

Property 15: Text2Cypher read-only validation

For any Cypher query generated by the Text2Cypher module, the read-only validator SHALL reject queries containing CREATE, DELETE, SET, MERGE, or DETACH keywords (case-insensitive), preventing accidental graph mutations.

Validates: Requirement 12.4

Property 16: Postgres search independence

For any search query executed against the same dataset, the BM25+Vector retrieval results (excluding the graph retriever contribution) SHALL be identical whether Neo4j is enabled or disabled — Neo4j is purely additive and never modifies Postgres search behavior.

Validates: Requirement 13.1


Error Handling

Graceful Degradation Cascade

Every Neo4j feature follows a strict degradation hierarchy. No optional component failure should ever cause a user-visible error or block document ingestion.

Component Health Check Order (at pipeline start):
  1. Sidecar (/health)     → gates Steps F, F2
  2. Neo4j (RETURN 1)      → gates Steps G, G2, G3, G4
  3. Extraction LLM (/health or first call) → gates Step F2
  4. GDS plugin (CALL gds.graph.list()) → gates Step G4

Failure Handling:
┌──────────────────────────┬────────────────────────────────────────────┐
│ Component Down           │ Behavior                                  │
├──────────────────────────┼────────────────────────────────────────────┤
│ NEO4J_URI not set        │ Entire graph pipeline disabled            │
│                          │ No entity extraction, no relationships    │
│                          │ No graph data stored anywhere             │
│                          │ Pipeline ends at Step E (Postgres only)   │
├──────────────────────────┼────────────────────────────────────────────┤
│ Sidecar unreachable      │ Skip entity extraction (Step F)           │
│ (NEO4J_URI is set)       │ No graph data for this document           │
│                          │ Document still ingested (Steps A-E ok)    │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j unreachable        │ Skip Neo4j writes (Step G)                │
│ (NEO4J_URI is set)       │ Entity extraction still runs but data     │
│                          │ is discarded (nowhere to write it)        │
│                          │ Ensemble search uses 2-retriever mode     │
├──────────────────────────┼────────────────────────────────────────────┤
│ Extraction LLM down      │ Skip relationship extraction (Step F2)    │
│                          │ Graph has BERT entities + CO_OCCURS only  │
├──────────────────────────┼────────────────────────────────────────────┤
│ GDS plugin unavailable   │ Skip community detection (Step G4)        │
│                          │ All other graph operations work normally  │
├──────────────────────────┼────────────────────────────────────────────┤
│ Vector index missing     │ Skip entity resolution (Step G3)          │
│                          │ Entities may have duplicates              │
├──────────────────────────┼────────────────────────────────────────────┤
│ Neo4j down at query time │ Graph retriever returns empty []          │
│                          │ Ensemble falls back to BM25+Vector        │
│                          │ No user-visible error                     │
├──────────────────────────┼────────────────────────────────────────────┤
│ External embedding API   │ Fall back to BERT via Sidecar             │
│ unreachable              │ Log warning, continue with 768-dim        │
└──────────────────────────┴────────────────────────────────────────────┘

Error Logging Convention

All Neo4j-related errors use structured log prefixes for easy filtering:

[Neo4jWriter]     — Direct write operations
[Neo4jRetriever]  — Query-time graph retrieval
[EntityResolution] — Duplicate entity merging
[ExtractionLLM]   — Relationship extraction via LLM
[EmbeddingProvider] — Embedding generation
[CommunityDetect] — Leiden clustering and summaries
[Text2Cypher]     — Natural language to Cypher translation

Testing Strategy

Dual Testing Approach

  • Unit tests: Specific examples, edge cases, error conditions, graceful degradation paths
  • Property tests: Universal properties across all inputs (fast-check, minimum 100 iterations)
  • Contract tests: Zod schema validation — runnable today with no infrastructure
  • Integration tests: Cross-component data flow with mocked boundaries

Parent Contract Test Strategy

These test files validate parent-level contracts. They are runnable immediately with no Neo4j, no Sidecar, no LLM — just Zod schema validation.

__tests__/api/graphrag/contracts/
├── graphrag-schemas.ts                    # Shared Zod schemas (source of truth)
├── contract.extract-entities.test.ts      # BERT entity response shape validation
├── contract.extract-relationships.test.ts # LLM extraction response shape validation
├── contract.neo4j-write-result.test.ts    # Neo4j write result shape validation
├── contract.neo4j-node-shapes.test.ts     # Entity, Section, Document, Topic, Community node shapes
├── contract.embedding-provider.test.ts    # Embedding result shape validation
├── contract.env-vars.test.ts              # Environment variable schema validation
└── contract.ensemble-search.test.ts       # Graph retriever result shape validation
Test File What It Validates Runs Against
contract.extract-entities.test.ts BERT entity response (base + enhanced with embeddings) ExtractEntitiesResponseSchema, ExtractEntitiesEnhancedResponseSchema
contract.extract-relationships.test.ts LLM extraction response, entity/relationship shapes, SCREAMING_SNAKE_CASE, dropped_relationships ExtractRelationshipsResponseSchema, ExtractionEntitySchema, ExtractionRelationshipSchema
contract.neo4j-write-result.test.ts Neo4j write result shape, dynamicRelTypes array Neo4jWriteResultSchema
contract.neo4j-node-shapes.test.ts All Neo4j node types: Entity, Section, Document, Topic, Community Neo4jEntityNodeSchema, Neo4jSectionNodeSchema, Neo4jDocumentNodeSchema, Neo4jTopicNodeSchema, Neo4jCommunityNodeSchema
contract.embedding-provider.test.ts Embedding result shape, batch result shape, dimensionality EmbeddingResultSchema, EmbeddingBatchResultSchema
contract.env-vars.test.ts All new env vars parse correctly, optional vars accept undefined, backward compat with GEMMA_* vars neo4jGraphRAGEnvSchema
contract.ensemble-search.test.ts Graph retriever result shape, dynamic weight computation GraphRetrieverResultSchema

Agent hook: On sub-feature spec completion, the agent runs pnpm test -- __tests__/api/graphrag/contracts/ to validate all parent contracts still pass.

Property-Based Test Configuration

  • Library: fast-check (already in devDependencies)
  • Minimum 100 iterations per property
  • Tag format: Feature: neo4j-graphrag, Property {N}: {title}
  • Each correctness property maps to exactly one *.pbt.test.ts file
  • Property tests are created in sub-feature specs, not in this parent spec

Cross-Cutting Concern Tests

Postgres Independence (Req 13):

  • Each sub-feature spec MUST include a test proving that Postgres search results are identical with Neo4j on vs off
  • Test pattern: run the same query twice (once with NEO4J_URI set, once without), compare BM25+Vector results

Graceful Degradation (Req 14):

  • Each sub-feature spec MUST include tests for every failure mode in the degradation cascade
  • Test pattern: mock the failing component, verify the pipeline continues without errors

Model-Agnostic Behavior (Reqs 2, 4):

  • Sub-feature specs for Reqs 2 and 4 MUST include tests proving no hardcoded model names or provider URLs
  • Test pattern: configure different EXTRACTION_LLM_BASE_URL values, verify the same code path is used

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions