Skip to content

Commit 807adef

Browse files
Merge pull request #80 from Deodat-Lawson/main
Nov 11 code update
2 parents 8784c6e + 04dc754 commit 807adef

32 files changed

Lines changed: 1639 additions & 356 deletions

File tree

.env.example

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Since the ".env" file is gitignored, you can use the ".env.example" file to
1+
# Since the ".env" file is gitignored, you can use the ".env.example" file to
22
# build a new ".env" file when you clone the repo. Keep this file up-to-date
33
# when you add new variables to `.env`.
44

@@ -23,5 +23,9 @@ OPENAI_API_KEY="your_openai_api_key"
2323
UPLOADTHING_SECRET="your_uploadthing_secret"
2424
UPLOADTHING_APP_ID="your_uploadthing_app_id"
2525

26+
# Datalab OCR API (optional - get from https://www.datalab.to/)
27+
# Required only if you want to enable OCR processing for scanned documents
28+
DATALAB_API_KEY="your_datalab_api_key"
29+
2630
# Environment
2731
NODE_ENV="development"

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,6 @@ yarn-error.log*
4343
*.tsbuildinfo
4444

4545
# idea files
46-
.idea
46+
.idea
47+
/.localFiles
48+
.windsurf/rules/markdowncreation.md

CHANGELOG.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
### Added
11+
- **OCR Processing Feature** - Advanced optical character recognition for scanned documents
12+
- New OCR service module (`src/app/api/services/ocrService.ts`) with Datalab Marker API integration
13+
- Asynchronous submission and polling architecture for OCR processing
14+
- Configurable OCR options (force_ocr, use_llm, output_format, strip_existing_ocr)
15+
- Comprehensive error handling and retry logic with 5-minute timeout
16+
- Database schema enhancements: `ocrEnabled`, `ocrProcessed`, `ocrMetadata` fields
17+
- Frontend OCR checkbox in document upload interface with help text
18+
- Custom styling for OCR checkbox with dark theme support
19+
- Optional `DATALAB_API_KEY` environment variable for OCR functionality
1120
- Enhanced environment variable validation in `src/env.js` with comprehensive schema for all required variables
1221
- New constants file (`src/lib/constants.ts`) for centralized configuration management
1322
- API utilities (`src/lib/api-utils.ts`) for standardized error handling and response patterns
1423
- Comprehensive TypeScript types (`src/types/api.ts`) for better type safety across the application
1524
- Missing environment variables in `.env.example` file with proper documentation
1625

1726
### Enhanced
27+
- **Document Upload API** (`src/app/api/uploadDocument/route.ts`):
28+
- Dual-path processing architecture: OCR path for scanned documents, standard path for digital PDFs
29+
- Unified chunking and embedding pipeline for both processing methods
30+
- Stores OCR metadata with document records for tracking and analytics
31+
- Support for `enableOCR` parameter in upload requests
32+
- Improved type safety with proper TypeScript interfaces
33+
1834
- **Predictive Document Analysis API** (`src/app/api/predictive-document-analysis/route.ts`):
1935
- Improved input validation with detailed error messages
2036
- Enhanced error handling with specific error types and HTTP status codes
@@ -35,6 +51,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3551
- Development experience with better IntelliSense and type checking
3652
- Documentation and code maintainability
3753

54+
### Fixed
55+
- **TypeScript/ESLint Compliance**:
56+
- Replaced all `any` types with proper TypeScript types in `ocrService.ts`
57+
- Fixed unsafe type assignments and member access violations
58+
- Removed trivially inferred type annotations
59+
- Replaced logical OR (`||`) with nullish coalescing (`??`) for safer null/undefined handling
60+
- Improved type safety in `uploadDocument/route.ts`
61+
- All linter errors resolved (38 errors fixed)
62+
3863
### Technical Improvements
3964
- Centralized configuration management
4065
- Standardized API response patterns
@@ -47,6 +72,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4772
- Updated `.env.example` with all required variables and documentation
4873
- Better configuration management with centralized constants
4974
- Improved development setup documentation
75+
- Added `DATALAB_API_KEY` for optional OCR functionality
76+
77+
### Documentation
78+
- **README.md** - Comprehensive OCR feature documentation:
79+
- Added OCR processing section with detailed usage guide
80+
- When to use OCR (scanned documents, image-based PDFs, handwritten content)
81+
- Backend infrastructure details (service module, database schema, API integration)
82+
- Frontend integration documentation (UI, validation, styling)
83+
- Processing flow diagrams for both standard and OCR paths
84+
- OCR vs Standard processing comparison table
85+
- Error handling documentation
86+
- Datalab API setup instructions
87+
- Environment variables reference updated with `DATALAB_API_KEY`
88+
- API endpoints section updated with OCR support details
89+
- Project structure updated to include OCR service
90+
- Added OCR troubleshooting section
91+
- **CHANGELOG.md** - Documented all OCR feature additions and linter fixes
5092

5193
## [Previous Versions]
5294
This changelog starts from the current state of the codebase. Previous version history can be found in the git commit history.

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) [2025] [Timothy Lin]
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 148 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ A Next.js application that uses advanced AI technology to analyze, interpret, an
1515

1616
### 📄 **Professional Document Analysis**
1717
- Advanced AI algorithms analyze documents and extract key information
18+
- **OCR Processing**: Optional advanced OCR using Datalab Marker API for scanned documents and images
1819
- **AI-Powered Chat**: Interactive chat interface for document-specific questions and insights
1920
- **Role-Based Authentication**: Separate interfaces for employees and employers using Clerk
2021
- **Document Management**: Upload, organize, and manage documents with category support
@@ -51,6 +52,117 @@ The system provides comprehensive analysis including:
5152

5253
## 📖 Usage Examples
5354

55+
### OCR Processing for Scanned Documents
56+
57+
PDR AI includes optional advanced OCR (Optical Character Recognition) capabilities for processing scanned documents, images, and PDFs with poor text extraction:
58+
59+
#### When to Use OCR
60+
- **Scanned Documents**: Physical documents that have been scanned to PDF
61+
- **Image-based PDFs**: PDFs that contain images of text rather than actual text
62+
- **Poor Quality Documents**: Documents with low-quality text that standard extraction can't read
63+
- **Handwritten Content**: Documents with handwritten notes or forms (with AI assistance)
64+
- **Mixed Content**: Documents combining text, images, tables, and diagrams
65+
66+
#### How It Works
67+
68+
**Backend Infrastructure:**
69+
1. **Environment Configuration**: Set `DATALAB_API_KEY` in your `.env` file (optional)
70+
2. **Database Schema**: Tracks OCR status with fields:
71+
- `ocrEnabled`: Boolean flag indicating if OCR was requested
72+
- `ocrProcessed`: Boolean flag indicating if OCR completed successfully
73+
- `ocrMetadata`: JSON field storing OCR processing details (page count, processing time, etc.)
74+
75+
3. **OCR Service Module** (`src/app/api/services/ocrService.ts`):
76+
- Complete Datalab Marker API integration
77+
- Asynchronous submission and polling architecture
78+
- Configurable processing options (force_ocr, use_llm, output_format)
79+
- Comprehensive error handling and retry logic
80+
- Timeout management (5 minutes default)
81+
82+
4. **Upload API Enhancement** (`src/app/api/uploadDocument/route.ts`):
83+
- **Dual-path processing**:
84+
- OCR Path: Uses Datalab Marker API when `enableOCR=true`
85+
- Standard Path: Uses traditional PDFLoader for regular PDFs
86+
- Unified chunking and embedding pipeline
87+
- Stores OCR metadata with document records
88+
89+
**Frontend Integration:**
90+
1. **Upload Form UI**: OCR checkbox appears when `DATALAB_API_KEY` is configured
91+
2. **Form Validation**: Schema validates `enableOCR` field
92+
3. **User Guidance**: Help text explains when to use OCR
93+
4. **Dark Theme Support**: Custom checkbox styling for both light and dark modes
94+
95+
#### Processing Flow
96+
97+
```typescript
98+
// Standard PDF Upload (enableOCR: false or not set)
99+
1. Download PDF from URL
100+
2. Extract text using PDFLoader
101+
3. Split into chunks
102+
4. Generate embeddings
103+
5. Store in database
104+
105+
// OCR-Enhanced Upload (enableOCR: true)
106+
1. Download PDF from URL
107+
2. Submit to Datalab Marker API
108+
3. Poll for completion (up to 5 minutes)
109+
4. Receive markdown/HTML/JSON output
110+
5. Split into chunks
111+
6. Generate embeddings
112+
7. Store in database with OCR metadata
113+
```
114+
115+
#### OCR Configuration Options
116+
117+
```typescript
118+
interface OCROptions {
119+
force_ocr?: boolean; // Force OCR even if text exists
120+
use_llm?: boolean; // Use AI for better accuracy
121+
output_format?: 'markdown' | 'json' | 'html'; // Output format
122+
strip_existing_ocr?: boolean; // Remove existing OCR layer
123+
}
124+
```
125+
126+
#### Using the OCR Feature
127+
128+
1. **Configure API Key** (one-time setup):
129+
```env
130+
DATALAB_API_KEY=your_datalab_api_key
131+
```
132+
133+
2. **Upload Document with OCR**:
134+
- Navigate to the employer upload page
135+
- Select your document
136+
- Check the "Enable OCR Processing" checkbox
137+
- Upload the document
138+
- System will process with OCR and notify when complete
139+
140+
3. **Monitor Processing**:
141+
- OCR processing typically takes 1-3 minutes
142+
- Progress is tracked in backend logs
143+
- Document becomes available once processing completes
144+
145+
#### OCR vs Standard Processing
146+
147+
| Feature | Standard Processing | OCR Processing |
148+
|---------|-------------------|----------------|
149+
| **Best For** | Digital PDFs with embedded text | Scanned documents, images |
150+
| **Processing Time** | < 10 seconds | 1-3 minutes |
151+
| **Accuracy** | High for digital text | High for scanned/image text |
152+
| **Cost** | Free (OpenAI embeddings only) | Requires Datalab API credits |
153+
| **Handwriting Support** | No | Yes (with AI assistance) |
154+
| **Table Extraction** | Basic | Advanced |
155+
| **Image Analysis** | No | Yes |
156+
157+
#### Error Handling
158+
159+
The OCR system includes comprehensive error handling:
160+
- API connection failures
161+
- Timeout management (5-minute limit)
162+
- Retry logic for transient errors
163+
- Graceful fallback messages
164+
- Detailed error logging
165+
54166
### Predictive Document Analysis
55167

56168
The predictive analysis feature automatically scans uploaded documents and provides comprehensive insights:
@@ -182,6 +294,7 @@ const response = await fetch('/api/LangChain', {
182294
- **Authentication**: [Clerk](https://clerk.com/)
183295
- **Database**: PostgreSQL with [Drizzle ORM](https://orm.drizzle.team/)
184296
- **AI Integration**: [OpenAI](https://openai.com/) + [LangChain](https://langchain.com/)
297+
- **OCR Processing**: [Datalab Marker API](https://www.datalab.to/) (optional)
185298
- **File Upload**: [UploadThing](https://uploadthing.com/)
186299
- **Styling**: [Tailwind CSS](https://tailwindcss.com/)
187300
- **Package Manager**: [pnpm](https://pnpm.io/)
@@ -248,6 +361,11 @@ LANGCHAIN_API_KEY=your_langchain_api_key
248361
# Used for finding related documents and external resources
249362
TAVILY_API_KEY=your_tavily_api_key
250363
364+
# Datalab Marker API (get from https://www.datalab.to/)
365+
# Optional: Required for advanced OCR processing of scanned documents
366+
# Enables OCR checkbox in document upload interface
367+
DATALAB_API_KEY=your_datalab_api_key
368+
251369
# UploadThing (get from https://uploadthing.com/)
252370
# Required for file uploads (PDF documents)
253371
UPLOADTHING_SECRET=your_uploadthing_secret
@@ -317,6 +435,13 @@ pnpm db:push
317435
3. Add `TAVILY_API_KEY` to your `.env` file
318436
4. Used for enhanced web search capabilities in document analysis features
319437

438+
#### Datalab Marker API - Optional
439+
1. Create account at [Datalab](https://www.datalab.to/)
440+
2. Navigate to the API section and generate an API key
441+
3. Add `DATALAB_API_KEY` to your `.env` file
442+
4. Enables advanced OCR processing for scanned documents and images in PDFs
443+
5. When configured, an OCR checkbox will appear in the document upload interface
444+
320445
#### UploadThing
321446
1. Create account at [UploadThing](https://uploadthing.com/)
322447
2. Create a new app
@@ -421,6 +546,7 @@ Vercel is the recommended platform for Next.js applications:
421546
- `LANGCHAIN_TRACING_V2=true` (optional, for LangSmith tracing)
422547
- `LANGCHAIN_API_KEY` (optional, required if `LANGCHAIN_TRACING_V2=true`)
423548
- `TAVILY_API_KEY` (optional, for enhanced web search)
549+
- `DATALAB_API_KEY` (optional, for OCR processing)
424550
- `NEXT_PUBLIC_CLERK_SIGN_IN_FORCE_REDIRECT_URL` (optional)
425551
- `NEXT_PUBLIC_CLERK_SIGN_UP_FORCE_REDIRECT_URL` (optional)
426552
- `NEXT_PUBLIC_CLERK_SIGN_OUT_FORCE_REDIRECT_URL` (optional)
@@ -621,11 +747,15 @@ src/
621747
│ │ ├── predictive-document-analysis/ # Predictive analysis endpoints
622748
│ │ │ ├── route.ts # Main analysis API
623749
│ │ │ └── agent.ts # AI analysis agent
750+
│ │ ├── services/ # Backend services
751+
│ │ │ └── ocrService.ts # OCR processing service
752+
│ │ ├── uploadDocument/ # Document upload endpoint
624753
│ │ ├── LangChain/ # AI chat functionality
625754
│ │ └── ... # Other API endpoints
626755
│ ├── employee/ # Employee dashboard pages
627756
│ ├── employer/ # Employer dashboard pages
628-
│ │ └── documents/ # Document viewer with predictive analysis
757+
│ │ ├── documents/ # Document viewer with predictive analysis
758+
│ │ └── upload/ # Document upload with OCR option
629759
│ ├── signup/ # Authentication pages
630760
│ └── _components/ # Shared components
631761
├── server/
@@ -637,6 +767,8 @@ Key directories:
637767
- `/employee` - Employee interface for document viewing and chat
638768
- `/employer` - Employer interface for management and uploads
639769
- `/api/predictive-document-analysis` - Core predictive analysis functionality
770+
- `/api/services` - Reusable backend services (OCR, etc.)
771+
- `/api/uploadDocument` - Document upload with OCR support
640772
- `/api` - Backend API endpoints for all functionality
641773
- `/server/db` - Database schema and configuration
642774
```
@@ -646,7 +778,12 @@ Key directories:
646778
### Predictive Document Analysis
647779
- `POST /api/predictive-document-analysis` - Analyze documents for missing content and recommendations
648780
- `GET /api/fetchDocument` - Retrieve document content for analysis
649-
- `POST /api/uploadDocument` - Upload documents for processing
781+
782+
### Document Upload & Processing
783+
- `POST /api/uploadDocument` - Upload documents for processing (supports OCR via `enableOCR` parameter)
784+
- Standard path: Uses PDFLoader for digital PDFs
785+
- OCR path: Uses Datalab Marker API for scanned documents
786+
- Returns document metadata including OCR processing status
650787
651788
### AI Chat & Q&A
652789
- `POST /api/LangChain` - AI-powered document Q&A
@@ -687,6 +824,7 @@ Key directories:
687824
| `LANGCHAIN_TRACING_V2` | Enable LangSmith tracing for LangChain operations. Set to `true` to enable. Get API key from [LangSmith](https://smith.langchain.com/) | ❌ | `true` or `false` |
688825
| `LANGCHAIN_API_KEY` | LangChain API key for LangSmith tracing and monitoring. Required if `LANGCHAIN_TRACING_V2=true`. Get from [LangSmith](https://smith.langchain.com/) | ❌ | `lsv2_...` |
689826
| `TAVILY_API_KEY` | Tavily Search API key for enhanced web search in document analysis. Get from [Tavily](https://tavily.com/) | ❌ | `tvly-...` |
827+
| `DATALAB_API_KEY` | Datalab Marker API key for advanced OCR processing of scanned documents. Get from [Datalab](https://www.datalab.to/) | ❌ | `your_datalab_key` |
690828
| `UPLOADTHING_SECRET` | UploadThing secret key for file uploads. Get from [UploadThing Dashboard](https://uploadthing.com/) | ✅ | `sk_live_...` |
691829
| `UPLOADTHING_APP_ID` | UploadThing application ID. Get from [UploadThing Dashboard](https://uploadthing.com/) | ✅ | `your_app_id` |
692830
| `NODE_ENV` | Environment mode. Must be one of: `development`, `test`, `production` | ✅ | `development` |
@@ -700,6 +838,7 @@ Key directories:
700838
- **AI Features**: `OPENAI_API_KEY` (used for embeddings, chat, and document analysis)
701839
- **AI Observability**: `LANGCHAIN_TRACING_V2`, `LANGCHAIN_API_KEY` (for LangSmith tracing and monitoring)
702840
- **Search Features**: `TAVILY_API_KEY` (for enhanced web search in document analysis)
841+
- **OCR Processing**: `DATALAB_API_KEY` (for advanced OCR of scanned documents)
703842
- **File Uploads**: `UPLOADTHING_SECRET`, `UPLOADTHING_APP_ID`
704843
- **Build Configuration**: `NODE_ENV`, `SKIP_ENV_VALIDATION`
705844
@@ -720,6 +859,13 @@ Key directories:
720859
- Reinstall dependencies: `rm -rf node_modules && pnpm install`
721860
- Check TypeScript errors: `pnpm typecheck`
722861
862+
### OCR Processing Issues
863+
- **OCR checkbox not appearing**: Verify `DATALAB_API_KEY` is set in your `.env` file
864+
- **OCR processing timeout**: Documents taking longer than 5 minutes will timeout; try with smaller documents first
865+
- **OCR processing failed**: Check API key validity and Datalab service status
866+
- **Poor OCR quality**: Enable `use_llm: true` option in OCR configuration for AI-enhanced accuracy
867+
- **Cost concerns**: OCR uses Datalab API credits; use only for scanned/image-based documents
868+
723869
## 🤝 Contributing
724870
725871
1. Fork the repository

0 commit comments

Comments
 (0)