Bookfix is a comprehensive GUI-based text processing application designed specifically for cleaning and formatting ebook text files for text-to-speech (TTS) and other applications. It provides both automated and interactive tools to transform raw ebook text into polished, readable content.
- File Support: Currently works with
.txt
,.html
, and.xhtml
files - Platform: Written in Python and tested on Linux OS
- Ebook Conversion: For best results, convert EPUB files to TXT using Calibre before processing
- EPUB Processing: While possible to process EPUB files directly by decompressing and handling HTML markup, TXT conversion is recommended for stability and ease of use
- Interactive Word Choices - Manually select replacements for specific words with keyboard shortcuts
- Automatic Replacements - Apply predefined find/replace rules from configuration
- Pagination Removal - Remove page numbers and pagination elements from HTML/TXT files
- Roman Numeral Conversion - Convert Roman numerals to Arabic numbers (II → 2, XIV → 14)
- All-Caps Sequence Processing - Interactive handling of uppercase sequences with auto-lowercase options
- Abbreviation Protection - Prevents conversion of common abbreviations (I.D., Ph.D., etc.)
- Numbered Line Editing - Manual editing interface for lines containing 3+ digit numbers
- Blank Line Removal - Clean up excessive whitespace and empty lines
- Input formats:
.txt
,.html
,.xhtml
- Output format:
.txt
(with_output
suffix) - BeautifulSoup integration for HTML/XHTML processing
- Checkbox controls for enabling/disabling processing steps
- Real-time text preview with syntax highlighting
- Progress tracking with visual progress bars
- Keyboard shortcuts for faster operation
- Status updates throughout processing
- Python 3.6 or higher
- Required packages:
tkinter
(usually included with Python)beautifulsoup4
# Install required package
pip install beautifulsoup4
# Run the program
python3 bookfix.py
# Check Python version
python3 --version
# Install dependencies
pip3 install beautifulsoup4
# If tkinter is missing (rare)
brew install python-tk
# Run
python3 bookfix.py
-
Launch the application
python3 bookfix.py
-
Set default directory (first run only)
- Select your ebook library or text files folder
- This setting is saved for future use
-
Select input file
- Choose a
.txt
,.html
, or.xhtml
file to process
- Choose a
-
Configure processing options
- Check/uncheck desired processing steps
- All options are enabled by default
-
Start processing
- Click "Start Processing" button
- Follow interactive prompts as needed
-
Save results
- Click "Save" when processing is complete
- Output saved as
filename_output.txt
The bookfix.py
script is a GUI tool built with the Tkinter library. Its main goal is to help users clean and standardize text from input files by providing a way to handle inconsistent wording and apply automatic cleanup rules.
The program guides the user through making decisions for specific words and then performs a series of automatic text transformations based on rules read from a separate data file.
The program processes text in the following order:
- Apply Automatic Replacements - Bulk find/replace operations
- Insert Periods into Abbreviations - Add periods to specified abbreviations
- Remove Pagination - Clean up page numbers and pagination elements
- Interactive Choices - Manual word-by-word replacement decisions
- Process All-Caps Sequences - Handle uppercase text interactively
- Convert Roman Numerals - Transform Roman numerals to Arabic numbers
- Convert to Lowercase - Optional full text lowercasing
- Remove Blank Lines - Clean up excessive whitespace
- Numbered Line Editing - Manual editing of lines with numbers
- Press number keys (1-9) to select replacement options
- View highlighted matches in context
- Progress tracking shows completion status
- Y/Yes - Lowercase this instance and all remaining instances
- N/No - Keep uppercase, skip for this session
- A/Add - Add to ignore list permanently
- I/Auto - Add to auto-lowercase list permanently
- Edit lines containing 3+ digit numbers
- Navigate with Previous/Next buttons
- Roman numeral reference guide included
The program uses a .data.txt
file for configuration with the following sections:
# CHOICE
word -> option1;option2;option3
# REPLACE
old_text -> new_text
# PERIODS
abbreviation_without_periods
# CAP_IGNORE
SEQUENCE_TO_IGNORE
# UPPER_TO_LOWER
SEQUENCE_TO_LOWERCASE
# DEFAULT_FILE_DIR
/path/to/your/ebook/folder
# CHOICE
colour -> color;colour
realise -> realize;realise
# REPLACE
-- -> —
... -> …
# PERIODS
Mr
Dr
St
# CAP_IGNORE
NASA
FBI
# UPPER_TO_LOWER
CHAPTER
BOOK
# DEFAULT_FILE_DIR
/Users/username/Documents/Ebooks
filename_output.txt
- Main processed outputdebug.txt
- Choice replacement logmatches.txt
- Detailed match processing logroman_conversions.log
- Roman numeral conversion logpagination_debug.txt
- Pagination removal logbookfix_execution.log
- Complete execution log
Detailed timestamped logging to both stderr and execution log files for debugging and verification.
The application provides a clean, intuitive interface with:
- File selection dialog at startup for choosing input files
- Main processing window with checkbox options for each feature
- Interactive dialogs for word choice selection and processing decisions
1-9
- Select replacement optionNumpad 1-9
- Select replacement option (alternative)
Y
- Yes (lowercase this and all remaining)N
- No (keep uppercase)A
- Add to ignore listI
- Auto-lowercase (add to permanent list)
-
File not found errors
- Check file path and permissions
- Ensure file is not open in another program
-
Missing dependencies
pip install beautifulsoup4
-
GUI not appearing
- Verify tkinter installation
- Check Python version compatibility
-
Slow processing
- Large files may take time
- Monitor progress bar for status
- Use forward slashes in paths:
C:/Users/name/Documents
- May require additional permissions for file access
- Grant file access permissions when prompted
- Use
python3
command explicitly
- Ensure display server is running for GUI
- Install tkinter if not included:
sudo apt-get install python3-tk
bookfix.py
- Main application file- Global variables for state management
- Modular functions for each processing step
- Tkinter GUI with event-driven architecture
- Add new processing steps to
run_processing()
- Extend
.data.txt
sections for new configuration options - Implement additional file format support
Below is an enumeration of the main functions in the application, with brief descriptions of their responsibilities.
- center_window(win)
Centers a given Tk window on screen.
- log_message(message, level)
Writes timestamped log entries to stderr and a log file, flushing immediately.
- load_data_file()
Manually parses .data.txt into sections: choices, replacements, periods, default directory, ignore, uppercase-to-lowercase.
- save_default_directory_to_data_file(dir)
Updates or creates the # DEFAULT_FILE_DIR section in .data.txt.
- save_caps_data_file(ignore, lowercase)
Updates # CAP_IGNORE and # UPPER_TO_LOWER sections in .data.txt.
- select_file()
Opens a file dialog for selecting an input file, respecting the default directory.
- process_choices()
Interactive find-and-replace according to choices rules, with progress bar.
- highlight_current_match()
Highlights the next match in the text area for user confirmation.
- handle_caps_choice(choice)
Handles user input (y/n/a/i) for all-caps sequences: lowercase now, ignore, or auto-lowercase across the document and persist rules.
- process_all_caps_sequences_gui()
Two-pass processing of all-caps sequences: automatic pass based on persistent rules, then interactive pass with buttons and keyboard shortcuts.
- apply_automatic_replacements()
Performs simple string replacements defined under # REPLACE.
- insert_periods_into_abbreviations()
Inserts dots into abbreviations defined under # PERIODS.
- convert_to_lowercase()
Converts the entire text buffer to lowercase.
- roman_to_arabic(roman)
Converts a single Roman numeral string to its integer equivalent, validating format.
- convert_roman_numerals()
Finds and replaces Roman numerals in the text with Arabic numbers, line by line.
- remove_pagination()
Detects and removes pagination elements in TXT and HTML files, logs removed items.
- remove_blank_lines(text)
Returns text with empty or whitespace-only lines removed.
- run_processing()
Orchestrates the full workflow based on checkbox states, including interactive and automatic steps, and displays the Save button.
- start_processing_button_command()
Disables the Start button, resets UI, clears old logs, and invokes run_processing().
- update_text_area()
Refreshes the displayed text to match the in-memory text variable.
- update_status_label(msg)
Updates the status label text in the GUI.
- save_file()
Saves the processed text to a new file with an _output.txt suffix.
- display_save_button()
Makes the Save button visible after processing.
- quit_program()
Exits the application cleanly.
This project is open source. Feel free to modify and distribute according to your needs.
For issues or questions:
- Check the log files for detailed error information
- Verify configuration file format
- Ensure all dependencies are installed
- Test with smaller files first
Last updated: 2025-07-20