VLM integration for enhanced PDF understanding

# VLM Integration for Enhanced PDF Understanding

## Background

The current PDF pipeline extracts text only. Figures, charts, tables, and diagrams carry semantic information that text extraction cannot capture. Vision Language Models (VLMs) can interpret these visual elements and make their content available for search and retrieval.

## How VLM Summaries Fit Into the Pipeline

The core idea: use a VLM to generate text descriptions of visual elements during ingest. These descriptions are embedded and stored alongside regular text chunks, making visual content searchable through the existing vector + FTS pipeline.

The consuming LLM (Claude, etc.) receives richer context without any change to the search interface.

## Data Storage Options

Where to store VLM-generated descriptions determines how they interact with search. Three options are on the table:

### Option A: Append to chunk text

Concatenate VLM description into the text chunk that corresponds to the same page.

```
text: "Design patterns improve... [Figure: bar chart comparing Singleton vs Factory, 30% throughput gain]"
```

- Searchable via both vector and FTS
- May dilute embedding focus — a chunk about design patterns now also contains chart description, which could weaken precision for both topics

### Option B: Store as metadata

Attach VLM description to the chunk's metadata field.

```
metadata: { fileName, fileSize, fileType, visualSummary: "Bar chart comparing..." }
```

- NOT searchable (metadata is not embedded or FTS-indexed in current schema)
- Returned with search results, so the consuming LLM sees it as additional context
- Zero impact on existing search behavior
- Useful when the visual content supplements nearby text rather than standing on its own

### Option C: Store as separate chunks

Create independent chunks for VLM descriptions with their own embeddings.

```
{ text: "Figure on page 12: bar chart comparing...", chunkIndex: N, chunkType: "visual", ... }
```

- Searchable independently via vector and FTS
- No embedding dilution
- Granularity mismatch: text chunks are 2-5 sentences (semantic chunker output), VLM chunks describe an entire page's visuals. Different granularities in the same table may affect grouping filters and score distribution
- Needs a way to link visual chunks back to their source page/context

### Comparison

| Criteria | A: In text | B: In metadata | C: Separate chunk |
|----------|-----------|----------------|-------------------|
| Visual content searchable (vector) | Yes | No | Yes |
| Visual content searchable (FTS) | Yes | No | Yes |
| Embedding quality preserved | Risk of dilution | Yes | Yes |
| Granularity consistency | Same chunk | Same chunk | Mixed granularity |
| Implementation complexity | Low | Low | Medium |
| Schema change | None | Metadata extension | Chunk type field |

Each option has trade-offs. The right choice depends on how VLM summaries actually perform in retrieval — prototyping with real PDFs is needed before committing to one.

## Candidate Technologies

### VLM Models (local inference, no API calls)

| Model | Params | Runtime | Notes |
|-------|--------|---------|-------|
| Qwen2-VL-2B | 2B | transformers.js (ONNX Q4), Ollama | ONNX version available, same runtime as existing embedder |
| GLM-OCR | 0.9B | Ollama, vLLM, SGLang | Top score on OmniDocBench (94.6), 1.86 pages/sec, tables and formulas to Markdown |
| ColSmol | 500M | Python (colpali-engine) | Visual retrieval embeddings, different paradigm (page-level vector search) |

Qwen2-VL-2B fits the current stack: it runs on `@huggingface/transformers` with ONNX, same as the embedding model. Model auto-downloads on first use (~1.5GB). No Python, no external service.

GLM-OCR scores higher on benchmarks but requires Ollama. Could serve as an optional backend for users who already have it.

### Page Image Rendering

mupdf (npm, WebAssembly) provides `page.toPixmap()` for rendering PDF pages to images. If we migrate text extraction to mupdf, the same package handles image rendering with no additional dependency.

## Design Considerations

### 1. Selective processing

Not every page has visual content worth summarizing. Candidate heuristics:
- Pages where mupdf reports image blocks in structured text output
- Pages with low text-to-area ratio
- Pages with detected table structures

Skipping text-only pages keeps VLM inference cost manageable.

### 2. Inference cost

VLM inference is 10-100x slower than text extraction per page. Prompt design matters — a prompt tuned for "describe this figure for search retrieval" produces different (and likely more useful) output than a generic "describe this image" prompt.

### 3. Page tracking

Current chunks do not track which PDF page they came from. VLM summaries are inherently page-scoped. To connect visual context with text context, chunks need page number metadata.

### 4. Embedding compatibility

VLM summaries are natural language text, so they embed in the same vector space as regular chunks. No multimodal embedding model is required. If retrieval quality for visual queries turns out to be weak, prompt tuning for the summary is the first lever to pull.

### 5. Ingest modes

Two modes: `text` (default, zero-setup) and `text+visual` (requires ~1.5GB model download). Switching modes on re-ingest should not corrupt existing data.

### 6. Storage

A 100-page PDF with ~30 pages containing figures adds ~30 extra chunks. This is modest compared to ColPali-style approaches (627 vectors per page).

## Architecture Sketch

```
ingest_file(filePath, mode = "text" | "text+visual")

text mode (default):
  PDF -> mupdf -> text chunks -> embedder -> LanceDB

text+visual mode:
  PDF -> mupdf -> text chunks -> embedder -> LanceDB
  PDF -> mupdf (toPixmap) -> select visual pages -> VLM -> description chunks -> embedder -> LanceDB
```

Search logic stays the same: vector search + FTS against one table.

## Open Questions

- Which storage option (A/B/C) works best in practice? Needs prototype comparison with real PDFs.
- How does chunk granularity mismatch (Option C) affect the grouping filter?
- What VLM prompt produces the most search-effective descriptions?
- Should users be able to plug in an Ollama endpoint for a stronger VLM?
- How to handle re-ingest when switching between text and text+visual modes?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM integration for enhanced PDF understanding #66

VLM Integration for Enhanced PDF Understanding

Background

How VLM Summaries Fit Into the Pipeline

Data Storage Options

Option A: Append to chunk text

Option B: Store as metadata

Option C: Store as separate chunks

Comparison

Candidate Technologies

VLM Models (local inference, no API calls)

Page Image Rendering

Design Considerations

1. Selective processing

2. Inference cost

3. Page tracking

4. Embedding compatibility

5. Ingest modes

6. Storage

Architecture Sketch

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Criteria	A: In text	B: In metadata	C: Separate chunk
Visual content searchable (vector)	Yes	No	Yes
Visual content searchable (FTS)	Yes	No	Yes
Embedding quality preserved	Risk of dilution	Yes	Yes
Granularity consistency	Same chunk	Same chunk	Mixed granularity
Implementation complexity	Low	Low	Medium
Schema change	None	Metadata extension	Chunk type field

Model	Params	Runtime	Notes
Qwen2-VL-2B	2B	transformers.js (ONNX Q4), Ollama	ONNX version available, same runtime as existing embedder
GLM-OCR	0.9B	Ollama, vLLM, SGLang	Top score on OmniDocBench (94.6), 1.86 pages/sec, tables and formulas to Markdown
ColSmol	500M	Python (colpali-engine)	Visual retrieval embeddings, different paradigm (page-level vector search)

VLM integration for enhanced PDF understanding #66

Description

VLM Integration for Enhanced PDF Understanding

Background

How VLM Summaries Fit Into the Pipeline

Data Storage Options

Option A: Append to chunk text

Option B: Store as metadata

Option C: Store as separate chunks

Comparison

Candidate Technologies

VLM Models (local inference, no API calls)

Page Image Rendering

Design Considerations

1. Selective processing

2. Inference cost

3. Page tracking

4. Embedding compatibility

5. Ingest modes

6. Storage

Architecture Sketch

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions