Feature Packet — Sidecar GraphRAG: BERT CLS Embeddings + Relationship Extraction
Parent Spec: .kiro/specs/neo4j-graphrag/
Workstream Owner: Dev B
Scope: R3 (BERT CLS Embeddings) + R4 (Relationship Extraction Endpoint)
Integration Surface: Pure Python sidecar — no Next.js or Neo4j dependencies
1. Feature Overview
The FastAPI sidecar (sidecar/) currently extracts named entities from text via BERT NER. This workstream adds two capabilities that the ingestion pipeline needs for the full GraphRAG system:
-
CLS Embeddings (R3): The existing /extract-entities endpoint gains an optional query parameter that makes it return a 768-dimensional BERT embedding per entity. This lets the downstream Neo4j write pipeline store entity vectors for semantic resolution — without breaking any existing callers.
-
Relationship Extraction (R4): A new /extract-relationships endpoint that uses any OpenAI-compatible LLM to extract typed semantic relationships (e.g., CEO_OF, ACQUIRED) from text chunks. The endpoint is fully model-agnostic — it reads an LLM base URL and model name from env vars and calls the standard chat completions API. When no LLM is configured, it degrades gracefully to empty results.
Both capabilities are self-contained in the sidecar. Dev A's Next.js ingestion pipeline will call these endpoints, but that's a separate workstream — your job is to make the sidecar produce correct responses that match the shared contracts.
2. Shared Contracts (Reference — Do Not Duplicate)
All shared schemas live in __tests__/api/graphrag/contracts/graphrag-schemas.ts (117 passing tests). Your Python responses must produce JSON that passes Zod validation against these schemas.
R3 schemas you must match:
EntityBaseSchema — { text: string, label: string, score: number(0..1) }
EntityWithEmbeddingSchema — extends EntityBase with embedding: number[768]
ExtractEntitiesResponseSchema — base response (no embeddings)
ExtractEntitiesEnhancedResponseSchema — response with embeddings
R4 schemas you must match:
ExtractionEntitySchema — { name: string, type: "PERSON"|"ORGANIZATION"|"LOCATION"|"PRODUCT"|"EVENT"|"OTHER" }
ExtractionRelationshipSchema — { source: string, target: string, type: string(SCREAMING_SNAKE_CASE), detail: string }
ExtractionChunkResultSchema — { text, entities, relationships, dropped_relationships }
ExtractRelationshipsResponseSchema — { results, total_entities, total_relationships, total_dropped }
Read the actual Zod definitions before you start. The contract tests are your ground truth.
3. Workstream Assignment — Dev B
Contract (R3 — BERT CLS Embeddings)
Enhance POST /extract-entities to optionally return BERT embeddings:
- Accept a
?include_embeddings=true query parameter
- When the param is set, each entity in the response includes an
embedding field: a list of exactly 768 floats
- When the param is absent (or false), return the existing response shape — no
embedding field at all. Existing callers must not break.
- Zero entities in a chunk → empty
entities array for that chunk, no error
- Empty
chunks input → {"results": [], "total_entities": 0}
/health continues to return HTTP 200 regardless
Contract (R4 — Relationship Extraction)
Add a new POST /extract-relationships endpoint:
- Request body:
{"chunks": ["text chunk 1", "text chunk 2", ...]}
- For each chunk, call an OpenAI-compatible LLM to extract entities and typed relationships
- The LLM endpoint is configured via two env vars:
EXTRACTION_LLM_BASE_URL — base URL (e.g., http://localhost:11434/v1)
EXTRACTION_LLM_MODEL — model name (e.g., gemma3:12b)
- Call
POST {EXTRACTION_LLM_BASE_URL}/chat/completions with the standard OpenAI request shape
- Use temperature
0.1 and max_tokens of at least 4096
- The system prompt must instruct the LLM to output structured JSON containing entities and relationships
- Relationship
type values must be SCREAMING_SNAKE_CASE matching ^[A-Z][A-Z0-9_]*$
- For each chunk, validate that every relationship's
source and target exist in that chunk's extracted entity list. Relationships that reference non-existent entities go into dropped_relationships.
- Strip
<think>...</think> tags from the LLM response before attempting JSON parse
- When
EXTRACTION_LLM_BASE_URL is unset → return HTTP 200 with {"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}
- When the LLM endpoint is unreachable or returns an error → same graceful empty response (HTTP 200, not 500)
/health returns HTTP 200 even when the LLM is unavailable
- No hardcoded model names, provider URLs, or provider-specific logic anywhere in the source
Constraints
- The NER model is already loaded at startup as
app.state.entity_extractor (an EntityExtractor instance wrapping dslim/bert-base-NER). Use it — don't load a second model.
- The NER pipeline uses
aggregation_strategy="simple", which means the pipeline output gives you merged entity spans (e.g., "Satya Nadella" as one entity, not "Sat", "##ya", "Na", "##della"). But the pipeline's output dict doesn't include hidden states. To get CLS embeddings, you'll need to interact with the underlying model directly — the pipeline object exposes .model and .tokenizer.
- The BERT model (
dslim/bert-base-NER) has a hidden size of 768. The CLS token is at position 0 of the last hidden state tensor.
- The sidecar uses the
transformers library. You have access to torch for tensor operations.
- For R4, the OpenAI chat completions contract is well-documented. You're calling it as a client, not implementing it. The request shape is
{"model": "...", "messages": [...], "temperature": 0.1, "max_tokens": 4096}. The response has choices[0].message.content.
- Python tests go in
sidecar/tests/ using pytest. Run with: PYTHONPATH=sidecar pytest sidecar/tests/ -v
- The sidecar's
main.py registers routers via app.include_router(). Follow the existing pattern for adding new routes.
Acceptance Criteria (R3)
POST /extract-entities with body {"chunks": ["Microsoft CEO Satya Nadella"]} returns a response that passes ExtractEntitiesResponseSchema.safeParse() — no embedding field present.
POST /extract-entities?include_embeddings=true with the same body returns a response that passes ExtractEntitiesEnhancedResponseSchema.safeParse() — each entity has embedding: number[768].
- Each embedding is a list of exactly 768 floating-point numbers.
POST /extract-entities with body {"chunks": []} or chunks that produce zero entities → {"results": [], "total_entities": 0} or results with empty entity arrays. No error.
GET /health returns HTTP 200.
Acceptance Criteria (R4)
POST /extract-relationships with body {"chunks": ["Satya Nadella is the CEO of Microsoft"]} returns a response that passes ExtractRelationshipsResponseSchema.safeParse().
- Every
type field in relationships and dropped_relationships matches the regex ^[A-Z][A-Z0-9_]*$.
- Any relationship whose
source or target does not appear in that chunk's entities list (by name) is present in dropped_relationships, not relationships.
- When
EXTRACTION_LLM_BASE_URL is not set → HTTP 200 with {"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}.
- When the LLM response contains
<think>reasoning here</think>{"entities": [...], "relationships": [...]}, the think tags are stripped and the JSON parses correctly.
GET /health returns HTTP 200 even when the LLM endpoint is down or unconfigured.
- No model name or provider URL is hardcoded in the source code.
grep -r "gemma\|ollama\|openai\.com\|localhost:11434" sidecar/app/ returns zero matches (test names and comments excluded).
Design Challenges
These are real decisions you need to make. There isn't one right answer — each approach has trade-offs. Pick one, implement it, and be ready to explain why.
Challenge 1: CLS Embedding Strategy
The NER pipeline's aggregation_strategy="simple" merges sub-word tokens into entity spans and gives you entity text + label + score. But it doesn't give you embeddings. You need to produce a single 768-dim vector per entity.
You have at least three options:
- (a) CLS token of the full input — Run the text through the model, take the hidden state at position 0 (the
[CLS] token). Every entity in that chunk gets the same embedding. Simple, but entities in the same chunk are indistinguishable by embedding.
- (b) Entity start-token hidden state — Tokenize the text, find the token position where each entity starts, extract that token's hidden state. Each entity gets a unique vector, but multi-token entities are represented by only their first token.
- (c) Mean-pool across entity span — Tokenize, find all token positions that belong to each entity span, average their hidden states. Most semantically rich, but you need to align the tokenizer's sub-word offsets with the NER pipeline's character-level entity spans.
Think about: What is this embedding used for downstream? (Entity resolution — matching "Microsoft" in one document to "Microsoft Corp" in another.) Which approach gives the most useful signal for that task? What's the complexity cost of each?
Challenge 2: Robust JSON Parsing from LLM Output
LLMs are unreliable JSON producers. The response from /chat/completions might contain:
- Clean JSON (best case)
- JSON wrapped in markdown code fences (
```json ... ```)
<think>...</think> tags followed by JSON (thinking models like Gemma)
- Partial JSON (truncated if max_tokens is hit)
- Completely malformed text
- An empty string
You need a parsing strategy that handles all of these gracefully. Consider:
- A sequential fallback chain: try direct parse → strip think tags and retry → extract from code fences → regex for JSON object → give up
- How many fallback layers are worth the complexity?
- What does "give up" look like? (Remember: the contract says HTTP 200 with empty results, not 500.)
4. Testing Strategy
4A. Contract Compliance Tests (Already Committed — Do Not Modify)
These TypeScript tests validate that your Python endpoint responses match the shared Zod schemas. They already exist and pass:
| Test File |
What It Validates |
__tests__/api/graphrag/contract.extract-entities.test.ts |
EntityBaseSchema, ExtractEntitiesResponseSchema, ExtractEntitiesEnhancedResponseSchema — base and enhanced response shapes |
__tests__/api/graphrag/contract.extract-relationships.test.ts |
ExtractionEntitySchema, ExtractionRelationshipSchema, ExtractRelationshipsResponseSchema — relationship types, entity types, dropped_relationships |
__tests__/api/graphrag/contract.bert-embeddings.test.ts |
EntityWithEmbeddingSchema — 768-dim embedding validation, dimension rejection, score range |
Your Python implementation must produce JSON that would pass these Zod validations. You don't run these tests directly — they validate the schema definitions. Your job is to produce responses that conform.
4B. Integration Boundary Tests (Skeletons — You Implement)
These test the actual sidecar endpoints. Create these files and fill in the test logic:
sidecar/tests/test_entities_embeddings.py
"""
Integration tests for POST /extract-entities with embedding support.
Tests the boundary between the FastAPI endpoint and the BERT model.
"""
import pytest
from fastapi.testclient import TestClient
# You'll need to set up the test client with the app.
# Hint: look at how main.py sets up lifespan and app.state.
class TestExtractEntitiesBase:
"""Tests for the base response (no embeddings)."""
def test_returns_entities_without_embedding_field(self):
"""POST /extract-entities with a known-entity chunk returns entities
with text, label, score — and NO embedding key."""
...
def test_empty_chunks_returns_empty_results(self):
"""POST /extract-entities with empty chunks list returns
{"results": [], "total_entities": 0}."""
...
class TestExtractEntitiesWithEmbeddings:
"""Tests for the enhanced response (include_embeddings=true)."""
def test_returns_768_dim_embedding_per_entity(self):
"""POST /extract-entities?include_embeddings=true returns each entity
with an embedding list of exactly 768 floats."""
...
def test_embedding_values_are_floats(self):
"""Each value in the embedding array is a Python float, not an int or string."""
...
def test_backward_compatible_without_param(self):
"""POST /extract-entities (no query param) returns the base shape —
entities do NOT have an embedding field."""
...
sidecar/tests/test_relationships.py
"""
Integration tests for POST /extract-relationships.
Tests the boundary between the FastAPI endpoint and the LLM client.
"""
import pytest
from fastapi.testclient import TestClient
class TestExtractRelationships:
"""Tests for the relationship extraction endpoint."""
def test_relationship_types_are_screaming_snake_case(self):
"""Every relationship type in the response matches ^[A-Z][A-Z0-9_]*$."""
...
def test_invalid_source_target_goes_to_dropped(self):
"""Relationships referencing entities not in the chunk's entity list
appear in dropped_relationships, not relationships."""
...
def test_graceful_empty_when_llm_unconfigured(self):
"""When EXTRACTION_LLM_BASE_URL is unset, returns HTTP 200 with
{"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}."""
...
def test_think_tag_stripping(self):
"""When LLM response contains <think>...</think> before JSON,
the tags are stripped and JSON parses correctly."""
...
def test_health_returns_200_without_llm(self):
"""GET /health returns 200 even when LLM endpoint is unavailable."""
...
4C. Dev-Proposed Edge Case Tests
Propose 2-3 additional test cases beyond the acceptance criteria. Think about:
- Long text chunks: What happens when a chunk exceeds 2048 characters? The existing NER model truncates at 2048 — does your embedding extraction handle the same truncation? Does the LLM endpoint handle very long prompts?
- Empty LLM response: What if the LLM returns an empty string or
{"choices": [{"message": {"content": ""}}]}? Does your fallback chain handle this without crashing?
- Unicode / non-English text: What happens when the input is
"東京はMicrosoftの拠点です" — does the BERT tokenizer handle it? Does the LLM? Do entity spans still align?
Add your proposed tests to your Design Brief when you submit your PR.
Quick Reference
| Item |
Location |
| Sidecar app entry |
sidecar/app/main.py |
| NER model class |
sidecar/app/models/ner.py → EntityExtractor |
| Embedder class |
sidecar/app/models/embedder.py → Embedder |
| Existing entities route |
sidecar/app/routes/entities.py |
| Existing embed route |
sidecar/app/routes/embed.py |
| Shared Zod schemas |
__tests__/api/graphrag/contracts/graphrag-schemas.ts |
| Contract tests (R3) |
__tests__/api/graphrag/contract.extract-entities.test.ts, contract.bert-embeddings.test.ts |
| Contract tests (R4) |
__tests__/api/graphrag/contract.extract-relationships.test.ts |
| Python test dir |
sidecar/tests/ |
| Run Python tests |
PYTHONPATH=sidecar pytest sidecar/tests/ -v |
| Env vars (R4) |
EXTRACTION_LLM_BASE_URL, EXTRACTION_LLM_MODEL |
Feature Packet — Sidecar GraphRAG: BERT CLS Embeddings + Relationship Extraction
Parent Spec:
.kiro/specs/neo4j-graphrag/Workstream Owner: Dev B
Scope: R3 (BERT CLS Embeddings) + R4 (Relationship Extraction Endpoint)
Integration Surface: Pure Python sidecar — no Next.js or Neo4j dependencies
1. Feature Overview
The FastAPI sidecar (
sidecar/) currently extracts named entities from text via BERT NER. This workstream adds two capabilities that the ingestion pipeline needs for the full GraphRAG system:CLS Embeddings (R3): The existing
/extract-entitiesendpoint gains an optional query parameter that makes it return a 768-dimensional BERT embedding per entity. This lets the downstream Neo4j write pipeline store entity vectors for semantic resolution — without breaking any existing callers.Relationship Extraction (R4): A new
/extract-relationshipsendpoint that uses any OpenAI-compatible LLM to extract typed semantic relationships (e.g.,CEO_OF,ACQUIRED) from text chunks. The endpoint is fully model-agnostic — it reads an LLM base URL and model name from env vars and calls the standard chat completions API. When no LLM is configured, it degrades gracefully to empty results.Both capabilities are self-contained in the sidecar. Dev A's Next.js ingestion pipeline will call these endpoints, but that's a separate workstream — your job is to make the sidecar produce correct responses that match the shared contracts.
2. Shared Contracts (Reference — Do Not Duplicate)
All shared schemas live in
__tests__/api/graphrag/contracts/graphrag-schemas.ts(117 passing tests). Your Python responses must produce JSON that passes Zod validation against these schemas.R3 schemas you must match:
EntityBaseSchema—{ text: string, label: string, score: number(0..1) }EntityWithEmbeddingSchema— extends EntityBase withembedding: number[768]ExtractEntitiesResponseSchema— base response (no embeddings)ExtractEntitiesEnhancedResponseSchema— response with embeddingsR4 schemas you must match:
ExtractionEntitySchema—{ name: string, type: "PERSON"|"ORGANIZATION"|"LOCATION"|"PRODUCT"|"EVENT"|"OTHER" }ExtractionRelationshipSchema—{ source: string, target: string, type: string(SCREAMING_SNAKE_CASE), detail: string }ExtractionChunkResultSchema—{ text, entities, relationships, dropped_relationships }ExtractRelationshipsResponseSchema—{ results, total_entities, total_relationships, total_dropped }Read the actual Zod definitions before you start. The contract tests are your ground truth.
3. Workstream Assignment — Dev B
Contract (R3 — BERT CLS Embeddings)
Enhance
POST /extract-entitiesto optionally return BERT embeddings:?include_embeddings=truequery parameterembeddingfield: a list of exactly 768 floatsembeddingfield at all. Existing callers must not break.entitiesarray for that chunk, no errorchunksinput →{"results": [], "total_entities": 0}/healthcontinues to return HTTP 200 regardlessContract (R4 — Relationship Extraction)
Add a new
POST /extract-relationshipsendpoint:{"chunks": ["text chunk 1", "text chunk 2", ...]}EXTRACTION_LLM_BASE_URL— base URL (e.g.,http://localhost:11434/v1)EXTRACTION_LLM_MODEL— model name (e.g.,gemma3:12b)POST {EXTRACTION_LLM_BASE_URL}/chat/completionswith the standard OpenAI request shape0.1andmax_tokensof at least4096typevalues must be SCREAMING_SNAKE_CASE matching^[A-Z][A-Z0-9_]*$sourceandtargetexist in that chunk's extracted entity list. Relationships that reference non-existent entities go intodropped_relationships.<think>...</think>tags from the LLM response before attempting JSON parseEXTRACTION_LLM_BASE_URLis unset → return HTTP 200 with{"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}/healthreturns HTTP 200 even when the LLM is unavailableConstraints
app.state.entity_extractor(anEntityExtractorinstance wrappingdslim/bert-base-NER). Use it — don't load a second model.aggregation_strategy="simple", which means the pipeline output gives you merged entity spans (e.g., "Satya Nadella" as one entity, not "Sat", "##ya", "Na", "##della"). But the pipeline's output dict doesn't include hidden states. To get CLS embeddings, you'll need to interact with the underlying model directly — the pipeline object exposes.modeland.tokenizer.dslim/bert-base-NER) has a hidden size of 768. The CLS token is at position 0 of the last hidden state tensor.transformerslibrary. You have access totorchfor tensor operations.{"model": "...", "messages": [...], "temperature": 0.1, "max_tokens": 4096}. The response haschoices[0].message.content.sidecar/tests/using pytest. Run with:PYTHONPATH=sidecar pytest sidecar/tests/ -vmain.pyregisters routers viaapp.include_router(). Follow the existing pattern for adding new routes.Acceptance Criteria (R3)
POST /extract-entitieswith body{"chunks": ["Microsoft CEO Satya Nadella"]}returns a response that passesExtractEntitiesResponseSchema.safeParse()— noembeddingfield present.POST /extract-entities?include_embeddings=truewith the same body returns a response that passesExtractEntitiesEnhancedResponseSchema.safeParse()— each entity hasembedding: number[768].POST /extract-entitieswith body{"chunks": []}or chunks that produce zero entities →{"results": [], "total_entities": 0}or results with empty entity arrays. No error.GET /healthreturns HTTP 200.Acceptance Criteria (R4)
POST /extract-relationshipswith body{"chunks": ["Satya Nadella is the CEO of Microsoft"]}returns a response that passesExtractRelationshipsResponseSchema.safeParse().typefield inrelationshipsanddropped_relationshipsmatches the regex^[A-Z][A-Z0-9_]*$.sourceortargetdoes not appear in that chunk'sentitieslist (byname) is present indropped_relationships, notrelationships.EXTRACTION_LLM_BASE_URLis not set → HTTP 200 with{"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}.<think>reasoning here</think>{"entities": [...], "relationships": [...]}, the think tags are stripped and the JSON parses correctly.GET /healthreturns HTTP 200 even when the LLM endpoint is down or unconfigured.grep -r "gemma\|ollama\|openai\.com\|localhost:11434" sidecar/app/returns zero matches (test names and comments excluded).Design Challenges
These are real decisions you need to make. There isn't one right answer — each approach has trade-offs. Pick one, implement it, and be ready to explain why.
Challenge 1: CLS Embedding Strategy
The NER pipeline's
aggregation_strategy="simple"merges sub-word tokens into entity spans and gives you entity text + label + score. But it doesn't give you embeddings. You need to produce a single 768-dim vector per entity.You have at least three options:
[CLS]token). Every entity in that chunk gets the same embedding. Simple, but entities in the same chunk are indistinguishable by embedding.Think about: What is this embedding used for downstream? (Entity resolution — matching "Microsoft" in one document to "Microsoft Corp" in another.) Which approach gives the most useful signal for that task? What's the complexity cost of each?
Challenge 2: Robust JSON Parsing from LLM Output
LLMs are unreliable JSON producers. The response from
/chat/completionsmight contain:```json ... ```)<think>...</think>tags followed by JSON (thinking models like Gemma)You need a parsing strategy that handles all of these gracefully. Consider:
4. Testing Strategy
4A. Contract Compliance Tests (Already Committed — Do Not Modify)
These TypeScript tests validate that your Python endpoint responses match the shared Zod schemas. They already exist and pass:
__tests__/api/graphrag/contract.extract-entities.test.tsEntityBaseSchema,ExtractEntitiesResponseSchema,ExtractEntitiesEnhancedResponseSchema— base and enhanced response shapes__tests__/api/graphrag/contract.extract-relationships.test.tsExtractionEntitySchema,ExtractionRelationshipSchema,ExtractRelationshipsResponseSchema— relationship types, entity types, dropped_relationships__tests__/api/graphrag/contract.bert-embeddings.test.tsEntityWithEmbeddingSchema— 768-dim embedding validation, dimension rejection, score rangeYour Python implementation must produce JSON that would pass these Zod validations. You don't run these tests directly — they validate the schema definitions. Your job is to produce responses that conform.
4B. Integration Boundary Tests (Skeletons — You Implement)
These test the actual sidecar endpoints. Create these files and fill in the test logic:
sidecar/tests/test_entities_embeddings.pysidecar/tests/test_relationships.py4C. Dev-Proposed Edge Case Tests
Propose 2-3 additional test cases beyond the acceptance criteria. Think about:
{"choices": [{"message": {"content": ""}}]}? Does your fallback chain handle this without crashing?"東京はMicrosoftの拠点です"— does the BERT tokenizer handle it? Does the LLM? Do entity spans still align?Add your proposed tests to your Design Brief when you submit your PR.
Quick Reference
sidecar/app/main.pysidecar/app/models/ner.py→EntityExtractorsidecar/app/models/embedder.py→Embeddersidecar/app/routes/entities.pysidecar/app/routes/embed.py__tests__/api/graphrag/contracts/graphrag-schemas.ts__tests__/api/graphrag/contract.extract-entities.test.ts,contract.bert-embeddings.test.ts__tests__/api/graphrag/contract.extract-relationships.test.tssidecar/tests/PYTHONPATH=sidecar pytest sidecar/tests/ -vEXTRACTION_LLM_BASE_URL,EXTRACTION_LLM_MODEL