An intelligent documentation translator that automatically synchronizes incremental documentation updates from a source language repository to a target language repository using AI-powered alignment and translation. It supports both pull request diff sync and commit diff sync, making it suitable for contributor-driven PR updates as well as scheduled translation workflows.
- Translates only what changed: Instead of re-translating entire documents, the tool identifies and translates only the modified sections, drastically reducing API costs
- Token-efficient processing: Smart section matching minimizes redundant AI calls
- Configurable limits: Set token budgets to control spending
- Context-aware translation: Provides AI with existing translations as reference, ensuring consistent terminology and style across updates
- Glossary-aware translation: Loads a project glossary (e.g.
terms.md) and automatically matches only the terms that appear in the current document, feeding them to the AI for accurate, consistent terminology — without bloating the prompt with the entire glossary - Preserves established translations: Reuses proven translations for unchanged content
- Maintains voice and tone: Keeps your documentation's character consistent over time
- Non-destructive updates: Only modifies sections that actually changed in the source diff
- Preserves untouched content: Sections not mentioned in the source diff remain completely unchanged in the target
- Section-level granularity: Surgical precision in applying updates, avoiding accidental overwrites
| Approach | Cost per Update | Consistency | Risk of Breaking Unchanged Content |
|---|---|---|---|
| Full Document Re-translation | 💸💸💸 High (entire doc) | ||
| Manual Section Translation | 💰 Medium (time-intensive) | ✅ Good if careful | |
| This Tool (Smart Incremental) | 💚 Low (only changes) | ✅ Excellent | ✅ Minimal - surgical updates |
- 🔄 Automated PR Synchronization: Analyzes source PR changes and applies translated updates to target repository
- 🕒 Commit-Based Incremental Sync: Analyzes source commit ranges and creates target-language updates from
base_ref -> head_refdiffs - 🤖 AI-Powered Translation: Supports multiple AI providers (DeepSeek, Gemini) for high-quality technical translation
- 📄 Smart File Operations: Handles added, deleted, and modified files intelligently
- 🎯 Section-Level Matching: Advanced algorithms match source and target document sections with high accuracy
- 🔧 GitHub Actions Ready: Designed to run seamlessly in CI/CD workflows
- Direct Matching: Exact matching for identical section hierarchies
- AI Fuzzy Matching: Handles restructured or renamed sections using AI
- Glossary Filtering: Matches only the relevant glossary terms for each document, keeping prompts lean and costs low
- System Variable Recognition: Automatically identifies configuration items and system variables
- Special File Handling: Custom logic for TOC files and configuration documents
- Batch Processing: Efficient handling of large documentation files
- Python 3.7+
- GitHub Personal Access Token with repo access
- API keys for your chosen AI provider (DeepSeek or Gemini)
-
Clone the repository:
git clone https://github.com/yourusername/ai-pr-translator.git cd ai-pr-translator -
Install dependencies:
cd scripts pip install -r requirements.txt
Set the variables for the mode you want to run.
# Required
export SOURCE_PR_URL="https://github.com/owner/repo/pull/123"
export TARGET_PR_URL="https://github.com/owner/repo-cn/pull/456"
export GITHUB_TOKEN="your_github_token"
export TARGET_REPO_PATH="/path/to/target/repo"
# AI Provider (choose one)
export AI_PROVIDER="deepseek" # or "gemini"
export DEEPSEEK_API_TOKEN="your_deepseek_token" # if using DeepSeek
# OR
export GEMINI_API_TOKEN="your_gemini_token" # if using Gemini
# Optional: Glossary for consistent term translation
export TERMS_PATH="/path/to/terms.md" # auto-detected from TARGET_REPO_PATH if not set
# Optional: Token limits
export MAX_NON_SYSTEM_SECTIONS_FOR_AI=120
export SOURCE_TOKEN_LIMIT=5000
export AI_MAX_TOKENS=20000
# Optional: File-level parallelism
export DIFF_PARALLEL_FILE_THRESHOLD=6 # parallelize when changed file count is greater than this
export DIFF_PARALLEL_WORKERS=4# Required
export SOURCE_REPO="owner/repo"
export TARGET_REPO="owner/repo-cn"
export GITHUB_TOKEN="your_github_token"
export TARGET_REPO_PATH="/path/to/target/repo"
export SOURCE_BASE_REF="abc123"
export SOURCE_HEAD_REF="def456"
# Optional: source branch label for logs / workflow context
export SOURCE_BRANCH="main"
# Optional: limit sync scope to a folder or explicit files
export SOURCE_FOLDER="ai"
export SOURCE_FILES="ai/foo.md,ai/bar.md"
# AI Provider and glossary
export AI_PROVIDER="deepseek" # or "gemini"
export TERMS_PATH="/path/to/terms.md"commit_sync_workflow.py always uses the explicit SOURCE_BASE_REF -> SOURCE_HEAD_REF compare range passed in by the caller.
cd scripts
python main_workflow.pycd scripts
python commit_sync_workflow.pyFor local verification with explicit source commits, you can also edit scripts/commit_sync_workflow_local.py and set SOURCE_BASE_REF / SOURCE_HEAD_REF directly before running:
cd scripts
python commit_sync_workflow_local.pyCreate a workflow file (.github/workflows/sync-docs.yml):
name: Sync Documentation
on:
pull_request:
types: [opened, synchronize]
paths:
- '**.md'
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: |
cd scripts
pip install -r requirements.txt
- name: Run sync
env:
SOURCE_PR_URL: ${{ github.event.pull_request.html_url }}
TARGET_PR_URL: ${{ secrets.TARGET_PR_URL }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DEEPSEEK_API_TOKEN: ${{ secrets.DEEPSEEK_API_TOKEN }}
TARGET_REPO_PATH: ${{ github.workspace }}
AI_PROVIDER: deepseek
run: |
cd scripts
python main_workflow.pyUse commit_sync_workflow.py when you want another workflow to compute a source commit range and pass it in explicitly, for example after reading a cursor file such as latest_translation_commit.json in the target repository and comparing it to the current source branch HEAD.
scripts/
├── main_workflow.py # PR-based orchestration entry point
├── commit_sync_workflow.py # Commit-based orchestration entry point for scheduled sync
├── diff_analyzer.py # Shared diff analysis for PR and commit workflows
├── section_matcher.py # Section matching (direct + AI fuzzy matching)
├── glossary.py # Glossary loading, term matching, and prompt formatting
├── file_adder.py # New file processing and translation
├── file_deleter.py # Deleted file processing
├── file_updater.py # Modified section processing and translation
├── toc_processor.py # Special TOC file handling
└── __init__.py # Package initialization
graph TD
A[Start] --> B{"Source Diff Type"}
B -->|PR| C[Analyze Source PR]
B -->|Commit Range| D[Analyze Commit Compare]
C --> E[Categorize Changes]
D --> E
E --> F{File Operation Type}
F -->|Added| G[Translate New Files]
F -->|Deleted| H[Remove Target Files]
F -->|Modified| I[Match Sections]
F -->|TOC| J[Process TOC Specially]
I --> K[AI Translation]
G --> K
J --> K
K --> L[Update Target Files]
L --> M[Create or Update Target PR]
M --> N[End]
-
Diff Analysis (
diff_analyzer.py)- Fetches a PR diff or a commit compare from GitHub to identify only what changed
- Parses markdown files and builds document hierarchy
- Categorizes changes by operation type (added/modified/deleted)
- Extracts section content and metadata
- Benefit: Eliminates unnecessary translation of unchanged content
In commit-based mode,
commit_sync_workflow.pyonly consumes the explicit compare range it is given. If you use a cursor file such aslatest_translation_commit.json, that file should be managed by the caller workflow (for example in the target repository), not by this repo. -
Section Matching (
section_matcher.py)- Direct matching for identical hierarchies
- AI-powered matching for restructured sections
- System variable detection and exact matching
- Confidence scoring for match quality
- Benefit: Precisely identifies which target sections need updates, protecting untouched content
-
AI Translation (
file_updater.py,file_adder.py)- Generates contextual prompts with source diff AND existing target translations
- Provides AI with reference translations for consistency
- Calls AI provider API (DeepSeek or Gemini)
- Token usage tracking and optimization
- Batch processing for large files
- Benefit: AI learns from existing translations, ensuring terminology consistency and reducing costs
-
Target Update (
file_updater.py)- Applies translated content only to matched sections
- Preserves formatting and structure
- Handles line-level updates for modified sections
- Leaves unmatched sections completely untouched
- Creates new files or removes deleted ones
- Benefit: Non-destructive updates with surgical precision
Control costs by setting limits:
MAX_NON_SYSTEM_SECTIONS_FOR_AI = 120 # Max sections per file
SOURCE_TOKEN_LIMIT = 5000 # Max tokens for source content
AI_MAX_TOKENS = 20000 # Max tokens per AI requestCustomize handling for specific files and folders in scripts/workflow_ignore_config.json:
{
"PR_MODE_IGNORE_FILES": [
"TOC-tidb-cloud.md",
"TOC-ai.md"
],
"PR_MODE_IGNORE_FOLDERS": [
"tidb-cloud",
"ai"
],
"COMMIT_BASED_MODE_IGNORE_FILES": [],
"COMMIT_BASED_MODE_IGNORE_FOLDERS": []
}Set VERBOSE_WORKFLOW_LOGS=true to print full AI prompts, AI responses, and
per-section update details when you need deep debugging. It is disabled by
default to keep GitHub Actions logs compact.
- Direct Matching: Exact hierarchy and title matching
- Normalized Matching: Title normalization for minor variations
- AI Fuzzy Matching: LLM-powered matching for complex restructures
- System Variable Matching: Special rules for configuration items
The tool generates debug files in temp_output/:
temp_output/
├── {file}-source-diff-dict.json # Source changes
├── {file}-match_source_diff_to_target.json # Section matching results
├── {file}-ai-prompt.txt # AI translation prompts
└── {file}-ai-response.txt # AI translation responses
A typical scenario: Your 1000-line documentation has a 5-line change. Traditional translation would cost tokens for all 1000 lines. This tool? Only the 5 lines + surrounding context, saving 95%+ of translation costs.
When updating technical documentation, the tool provides AI with your existing translations. If you previously translated "同步" as "replicate" in English, the AI will maintain that term instead of using variations like "synchronize", ensuring consistency.
- Documentation Internationalization: Maintain English and Chinese versions of technical docs with incremental updates
- Cross-Repository Sync: Keep documentation in sync across multiple repos without re-translating unchanged content
- Translation Quality Assurance: Review only the changed sections before merging, not entire documents
- Large-Scale Documentation: Handle repositories with thousands of markdown files efficiently by translating only incremental PR or commit changes
- Scheduled Folder Sync: Periodically translate directories such as
docs/aifrom the source repo and automatically open a target-language PR
This project is licensed under the MIT License - see the LICENSE file for details.