Update Jan 31#150
Conversation
…ent function. Added new interfaces for PDF text items and content, updated pagerender type, and implemented chunk insertion and OCR metadata updates in the database.
- Lazy load determineDocumentRouting in processDocument.ts - This prevents HuggingFace Transformers and onnxruntime-node from being traced - Add serverExternalPackages for heavy dependencies - Add outputFileTracingExcludes for Inngest route - Add vercel.json with function configuration Reduces bundle from 453MB to under 250MB limit
- Set TRANSFORMERS_BACKEND=wasm environment variable - Configure USE_ONNX_NODE=false to prevent Node.js runtime - This ensures onnxruntime-node (404MB) is never required - Lazy loading + WASM backend = minimal bundle size
Feature/improve ocr workflow
…Deodat-Lawson/pdr_ai_v2 into feature/knowledge-base-redesign
…esign Knowledge Base Redesign - RLM-Ready Hierarchical Document Storage
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7678ca57ad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const { jobId, eventIds } = await triggerDocumentProcessing( | ||
| documentUrl, | ||
| documentName, | ||
| companyId, | ||
| userId, |
There was a problem hiding this comment.
Insert OCR job before dispatching Inngest event
The event is dispatched here before the ocr_jobs row exists. If the Inngest worker starts quickly, the completion update in processDocument (step‑e) will run against a missing row and no status change will be persisted, leaving the job stuck as queued when clients poll /api/uploadDocument?jobId. Insert the job record before sending the event or make the completion update an upsert/create-on-miss.
Useful? React with 👍 / 👎.
| await step.run("step-e-storage", async () => { | ||
| if (vectorizedChunks.length === 0) { | ||
| console.log("[Step E] No chunks to store"); | ||
| return; |
There was a problem hiding this comment.
Record failure/complete status when no chunks are produced
When vectorizedChunks is empty, the function returns early and skips updating document and ocr_jobs. Any document that yields no chunks (e.g., OCR returns empty text, parsing errors, or a truly blank PDF) will never transition out of queued, which makes status polling hang indefinitely. Consider marking the job failed or completed-with-zero-chunks and updating the document metadata even in the empty case.
Useful? React with 👍 / 👎.
| pages.push({ | ||
| pageNumber: 1, | ||
| textBlocks: [data.text], | ||
| tables: [] | ||
| }); |
There was a problem hiding this comment.
Avoid tagging all native-PDF content as page 1
Native PDF parsing collapses the entire document into a single page and hardcodes pageNumber: 1. For multi-page PDFs this makes every chunk appear to come from page 1, so recommendedPages/citations become incorrect even though the document has multiple pages. If per-page splitting isn’t available, consider setting pageNumber to null or approximating page ranges rather than marking everything as page 1.
Useful? React with 👍 / 👎.
No description provided.