neo4j-graphrag-p1-sidecar

# Feature Packet — Sidecar GraphRAG: BERT CLS Embeddings + Relationship Extraction

**Parent Spec:** `.kiro/specs/neo4j-graphrag/`
**Workstream Owner:** Dev B
**Scope:** R3 (BERT CLS Embeddings) + R4 (Relationship Extraction Endpoint)
**Integration Surface:** Pure Python sidecar — no Next.js or Neo4j dependencies

---

## 1. Feature Overview

The FastAPI sidecar (`sidecar/`) currently extracts named entities from text via BERT NER. This workstream adds two capabilities that the ingestion pipeline needs for the full GraphRAG system:

1. **CLS Embeddings (R3):** The existing `/extract-entities` endpoint gains an optional query parameter that makes it return a 768-dimensional BERT embedding per entity. This lets the downstream Neo4j write pipeline store entity vectors for semantic resolution — without breaking any existing callers.

2. **Relationship Extraction (R4):** A new `/extract-relationships` endpoint that uses any OpenAI-compatible LLM to extract typed semantic relationships (e.g., `CEO_OF`, `ACQUIRED`) from text chunks. The endpoint is fully model-agnostic — it reads an LLM base URL and model name from env vars and calls the standard chat completions API. When no LLM is configured, it degrades gracefully to empty results.

Both capabilities are self-contained in the sidecar. Dev A's Next.js ingestion pipeline will call these endpoints, but that's a separate workstream — your job is to make the sidecar produce correct responses that match the shared contracts.

---

## 2. Shared Contracts (Reference — Do Not Duplicate)

All shared schemas live in `__tests__/api/graphrag/contracts/graphrag-schemas.ts` (117 passing tests). Your Python responses must produce JSON that passes Zod validation against these schemas.

**R3 schemas you must match:**
- `EntityBaseSchema` — `{ text: string, label: string, score: number(0..1) }`
- `EntityWithEmbeddingSchema` — extends EntityBase with `embedding: number[768]`
- `ExtractEntitiesResponseSchema` — base response (no embeddings)
- `ExtractEntitiesEnhancedResponseSchema` — response with embeddings

**R4 schemas you must match:**
- `ExtractionEntitySchema` — `{ name: string, type: "PERSON"|"ORGANIZATION"|"LOCATION"|"PRODUCT"|"EVENT"|"OTHER" }`
- `ExtractionRelationshipSchema` — `{ source: string, target: string, type: string(SCREAMING_SNAKE_CASE), detail: string }`
- `ExtractionChunkResultSchema` — `{ text, entities, relationships, dropped_relationships }`
- `ExtractRelationshipsResponseSchema` — `{ results, total_entities, total_relationships, total_dropped }`

Read the actual Zod definitions before you start. The contract tests are your ground truth.

---

## 3. Workstream Assignment — Dev B

### Contract (R3 — BERT CLS Embeddings)

Enhance `POST /extract-entities` to optionally return BERT embeddings:

- Accept a `?include_embeddings=true` query parameter
- When the param is set, each entity in the response includes an `embedding` field: a list of exactly 768 floats
- When the param is absent (or false), return the existing response shape — no `embedding` field at all. Existing callers must not break.
- Zero entities in a chunk → empty `entities` array for that chunk, no error
- Empty `chunks` input → `{"results": [], "total_entities": 0}`
- `/health` continues to return HTTP 200 regardless

### Contract (R4 — Relationship Extraction)

Add a new `POST /extract-relationships` endpoint:

- Request body: `{"chunks": ["text chunk 1", "text chunk 2", ...]}`
- For each chunk, call an OpenAI-compatible LLM to extract entities and typed relationships
- The LLM endpoint is configured via two env vars:
  - `EXTRACTION_LLM_BASE_URL` — base URL (e.g., `http://localhost:11434/v1`)
  - `EXTRACTION_LLM_MODEL` — model name (e.g., `gemma3:12b`)
- Call `POST {EXTRACTION_LLM_BASE_URL}/chat/completions` with the standard OpenAI request shape
- Use temperature `0.1` and `max_tokens` of at least `4096`
- The system prompt must instruct the LLM to output structured JSON containing entities and relationships
- Relationship `type` values must be SCREAMING_SNAKE_CASE matching `^[A-Z][A-Z0-9_]*$`
- For each chunk, validate that every relationship's `source` and `target` exist in that chunk's extracted entity list. Relationships that reference non-existent entities go into `dropped_relationships`.
- Strip `<think>...</think>` tags from the LLM response before attempting JSON parse
- When `EXTRACTION_LLM_BASE_URL` is unset → return HTTP 200 with `{"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}`
- When the LLM endpoint is unreachable or returns an error → same graceful empty response (HTTP 200, not 500)
- `/health` returns HTTP 200 even when the LLM is unavailable
- No hardcoded model names, provider URLs, or provider-specific logic anywhere in the source

### Constraints

- The NER model is already loaded at startup as `app.state.entity_extractor` (an `EntityExtractor` instance wrapping `dslim/bert-base-NER`). Use it — don't load a second model.
- The NER pipeline uses `aggregation_strategy="simple"`, which means the pipeline output gives you merged entity spans (e.g., "Satya Nadella" as one entity, not "Sat", "##ya", "Na", "##della"). But the pipeline's output dict doesn't include hidden states. To get CLS embeddings, you'll need to interact with the underlying model directly — the pipeline object exposes `.model` and `.tokenizer`.
- The BERT model (`dslim/bert-base-NER`) has a hidden size of 768. The CLS token is at position 0 of the last hidden state tensor.
- The sidecar uses the `transformers` library. You have access to `torch` for tensor operations.
- For R4, the OpenAI chat completions contract is well-documented. You're calling it as a client, not implementing it. The request shape is `{"model": "...", "messages": [...], "temperature": 0.1, "max_tokens": 4096}`. The response has `choices[0].message.content`.
- Python tests go in `sidecar/tests/` using pytest. Run with: `PYTHONPATH=sidecar pytest sidecar/tests/ -v`
- The sidecar's `main.py` registers routers via `app.include_router()`. Follow the existing pattern for adding new routes.

### Acceptance Criteria (R3)

1. `POST /extract-entities` with body `{"chunks": ["Microsoft CEO Satya Nadella"]}` returns a response that passes `ExtractEntitiesResponseSchema.safeParse()` — no `embedding` field present.
2. `POST /extract-entities?include_embeddings=true` with the same body returns a response that passes `ExtractEntitiesEnhancedResponseSchema.safeParse()` — each entity has `embedding: number[768]`.
3. Each embedding is a list of exactly 768 floating-point numbers.
4. `POST /extract-entities` with body `{"chunks": []}` or chunks that produce zero entities → `{"results": [], "total_entities": 0}` or results with empty entity arrays. No error.
5. `GET /health` returns HTTP 200.

### Acceptance Criteria (R4)

1. `POST /extract-relationships` with body `{"chunks": ["Satya Nadella is the CEO of Microsoft"]}` returns a response that passes `ExtractRelationshipsResponseSchema.safeParse()`.
2. Every `type` field in `relationships` and `dropped_relationships` matches the regex `^[A-Z][A-Z0-9_]*$`.
3. Any relationship whose `source` or `target` does not appear in that chunk's `entities` list (by `name`) is present in `dropped_relationships`, not `relationships`.
4. When `EXTRACTION_LLM_BASE_URL` is not set → HTTP 200 with `{"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}`.
5. When the LLM response contains `<think>reasoning here</think>{"entities": [...], "relationships": [...]}`, the think tags are stripped and the JSON parses correctly.
6. `GET /health` returns HTTP 200 even when the LLM endpoint is down or unconfigured.
7. No model name or provider URL is hardcoded in the source code. `grep -r "gemma\|ollama\|openai\.com\|localhost:11434" sidecar/app/` returns zero matches (test names and comments excluded).


### Design Challenges

These are real decisions you need to make. There isn't one right answer — each approach has trade-offs. Pick one, implement it, and be ready to explain why.

**Challenge 1: CLS Embedding Strategy**

The NER pipeline's `aggregation_strategy="simple"` merges sub-word tokens into entity spans and gives you entity text + label + score. But it doesn't give you embeddings. You need to produce a single 768-dim vector per entity.

You have at least three options:
- **(a) CLS token of the full input** — Run the text through the model, take the hidden state at position 0 (the `[CLS]` token). Every entity in that chunk gets the same embedding. Simple, but entities in the same chunk are indistinguishable by embedding.
- **(b) Entity start-token hidden state** — Tokenize the text, find the token position where each entity starts, extract that token's hidden state. Each entity gets a unique vector, but multi-token entities are represented by only their first token.
- **(c) Mean-pool across entity span** — Tokenize, find all token positions that belong to each entity span, average their hidden states. Most semantically rich, but you need to align the tokenizer's sub-word offsets with the NER pipeline's character-level entity spans.

Think about: What is this embedding used for downstream? (Entity resolution — matching "Microsoft" in one document to "Microsoft Corp" in another.) Which approach gives the most useful signal for that task? What's the complexity cost of each?

**Challenge 2: Robust JSON Parsing from LLM Output**

LLMs are unreliable JSON producers. The response from `/chat/completions` might contain:
- Clean JSON (best case)
- JSON wrapped in markdown code fences (` ```json ... ``` `)
- `<think>...</think>` tags followed by JSON (thinking models like Gemma)
- Partial JSON (truncated if max_tokens is hit)
- Completely malformed text
- An empty string

You need a parsing strategy that handles all of these gracefully. Consider:
- A sequential fallback chain: try direct parse → strip think tags and retry → extract from code fences → regex for JSON object → give up
- How many fallback layers are worth the complexity?
- What does "give up" look like? (Remember: the contract says HTTP 200 with empty results, not 500.)

---

## 4. Testing Strategy

### 4A. Contract Compliance Tests (Already Committed — Do Not Modify)

These TypeScript tests validate that your Python endpoint responses match the shared Zod schemas. They already exist and pass:

| Test File | What It Validates |
|---|---|
| `__tests__/api/graphrag/contract.extract-entities.test.ts` | `EntityBaseSchema`, `ExtractEntitiesResponseSchema`, `ExtractEntitiesEnhancedResponseSchema` — base and enhanced response shapes |
| `__tests__/api/graphrag/contract.extract-relationships.test.ts` | `ExtractionEntitySchema`, `ExtractionRelationshipSchema`, `ExtractRelationshipsResponseSchema` — relationship types, entity types, dropped_relationships |
| `__tests__/api/graphrag/contract.bert-embeddings.test.ts` | `EntityWithEmbeddingSchema` — 768-dim embedding validation, dimension rejection, score range |

Your Python implementation must produce JSON that would pass these Zod validations. You don't run these tests directly — they validate the schema definitions. Your job is to produce responses that conform.

### 4B. Integration Boundary Tests (Skeletons — You Implement)

These test the actual sidecar endpoints. Create these files and fill in the test logic:

**`sidecar/tests/test_entities_embeddings.py`**
```python
"""
Integration tests for POST /extract-entities with embedding support.
Tests the boundary between the FastAPI endpoint and the BERT model.
"""
import pytest
from fastapi.testclient import TestClient

# You'll need to set up the test client with the app.
# Hint: look at how main.py sets up lifespan and app.state.


class TestExtractEntitiesBase:
    """Tests for the base response (no embeddings)."""

    def test_returns_entities_without_embedding_field(self):
        """POST /extract-entities with a known-entity chunk returns entities
        with text, label, score — and NO embedding key."""
        ...

    def test_empty_chunks_returns_empty_results(self):
        """POST /extract-entities with empty chunks list returns
        {"results": [], "total_entities": 0}."""
        ...


class TestExtractEntitiesWithEmbeddings:
    """Tests for the enhanced response (include_embeddings=true)."""

    def test_returns_768_dim_embedding_per_entity(self):
        """POST /extract-entities?include_embeddings=true returns each entity
        with an embedding list of exactly 768 floats."""
        ...

    def test_embedding_values_are_floats(self):
        """Each value in the embedding array is a Python float, not an int or string."""
        ...

    def test_backward_compatible_without_param(self):
        """POST /extract-entities (no query param) returns the base shape —
        entities do NOT have an embedding field."""
        ...
```

**`sidecar/tests/test_relationships.py`**
```python
"""
Integration tests for POST /extract-relationships.
Tests the boundary between the FastAPI endpoint and the LLM client.
"""
import pytest
from fastapi.testclient import TestClient


class TestExtractRelationships:
    """Tests for the relationship extraction endpoint."""

    def test_relationship_types_are_screaming_snake_case(self):
        """Every relationship type in the response matches ^[A-Z][A-Z0-9_]*$."""
        ...

    def test_invalid_source_target_goes_to_dropped(self):
        """Relationships referencing entities not in the chunk's entity list
        appear in dropped_relationships, not relationships."""
        ...

    def test_graceful_empty_when_llm_unconfigured(self):
        """When EXTRACTION_LLM_BASE_URL is unset, returns HTTP 200 with
        {"results": [], "total_entities": 0, "total_relationships": 0, "total_dropped": 0}."""
        ...

    def test_think_tag_stripping(self):
        """When LLM response contains <think>...</think> before JSON,
        the tags are stripped and JSON parses correctly."""
        ...

    def test_health_returns_200_without_llm(self):
        """GET /health returns 200 even when LLM endpoint is unavailable."""
        ...
```

### 4C. Dev-Proposed Edge Case Tests

Propose 2-3 additional test cases beyond the acceptance criteria. Think about:

- **Long text chunks:** What happens when a chunk exceeds 2048 characters? The existing NER model truncates at 2048 — does your embedding extraction handle the same truncation? Does the LLM endpoint handle very long prompts?
- **Empty LLM response:** What if the LLM returns an empty string or `{"choices": [{"message": {"content": ""}}]}`? Does your fallback chain handle this without crashing?
- **Unicode / non-English text:** What happens when the input is `"東京はMicrosoftの拠点です"` — does the BERT tokenizer handle it? Does the LLM? Do entity spans still align?

Add your proposed tests to your Design Brief when you submit your PR.

---

## Quick Reference

| Item | Location |
|---|---|
| Sidecar app entry | `sidecar/app/main.py` |
| NER model class | `sidecar/app/models/ner.py` → `EntityExtractor` |
| Embedder class | `sidecar/app/models/embedder.py` → `Embedder` |
| Existing entities route | `sidecar/app/routes/entities.py` |
| Existing embed route | `sidecar/app/routes/embed.py` |
| Shared Zod schemas | `__tests__/api/graphrag/contracts/graphrag-schemas.ts` |
| Contract tests (R3) | `__tests__/api/graphrag/contract.extract-entities.test.ts`, `contract.bert-embeddings.test.ts` |
| Contract tests (R4) | `__tests__/api/graphrag/contract.extract-relationships.test.ts` |
| Python test dir | `sidecar/tests/` |
| Run Python tests | `PYTHONPATH=sidecar pytest sidecar/tests/ -v` |
| Env vars (R4) | `EXTRACTION_LLM_BASE_URL`, `EXTRACTION_LLM_MODEL` |


Test File	What It Validates
`__tests__/api/graphrag/contract.extract-entities.test.ts`	`EntityBaseSchema`, `ExtractEntitiesResponseSchema`, `ExtractEntitiesEnhancedResponseSchema` — base and enhanced response shapes
`__tests__/api/graphrag/contract.extract-relationships.test.ts`	`ExtractionEntitySchema`, `ExtractionRelationshipSchema`, `ExtractRelationshipsResponseSchema` — relationship types, entity types, dropped_relationships
`__tests__/api/graphrag/contract.bert-embeddings.test.ts`	`EntityWithEmbeddingSchema` — 768-dim embedding validation, dimension rejection, score range

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neo4j-graphrag-p1-sidecar #278

Feature Packet — Sidecar GraphRAG: BERT CLS Embeddings + Relationship Extraction

1. Feature Overview

2. Shared Contracts (Reference — Do Not Duplicate)

3. Workstream Assignment — Dev B

Contract (R3 — BERT CLS Embeddings)

Contract (R4 — Relationship Extraction)

Constraints

Acceptance Criteria (R3)

Acceptance Criteria (R4)

Design Challenges

4. Testing Strategy

4A. Contract Compliance Tests (Already Committed — Do Not Modify)

4B. Integration Boundary Tests (Skeletons — You Implement)

4C. Dev-Proposed Edge Case Tests

Quick Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Item	Location
Sidecar app entry	`sidecar/app/main.py`
NER model class	`sidecar/app/models/ner.py` → `EntityExtractor`
Embedder class	`sidecar/app/models/embedder.py` → `Embedder`
Existing entities route	`sidecar/app/routes/entities.py`
Existing embed route	`sidecar/app/routes/embed.py`
Shared Zod schemas	`__tests__/api/graphrag/contracts/graphrag-schemas.ts`
Contract tests (R3)	`__tests__/api/graphrag/contract.extract-entities.test.ts`, `contract.bert-embeddings.test.ts`
Contract tests (R4)	`__tests__/api/graphrag/contract.extract-relationships.test.ts`
Python test dir	`sidecar/tests/`
Run Python tests	`PYTHONPATH=sidecar pytest sidecar/tests/ -v`
Env vars (R4)	`EXTRACTION_LLM_BASE_URL`, `EXTRACTION_LLM_MODEL`

neo4j-graphrag-p1-sidecar #278

Description

Feature Packet — Sidecar GraphRAG: BERT CLS Embeddings + Relationship Extraction

1. Feature Overview

2. Shared Contracts (Reference — Do Not Duplicate)

3. Workstream Assignment — Dev B

Contract (R3 — BERT CLS Embeddings)

Contract (R4 — Relationship Extraction)

Constraints

Acceptance Criteria (R3)

Acceptance Criteria (R4)

Design Challenges

4. Testing Strategy

4A. Contract Compliance Tests (Already Committed — Do Not Modify)

4B. Integration Boundary Tests (Skeletons — You Implement)

4C. Dev-Proposed Edge Case Tests

Quick Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions