A high-performance, concurrent web scraper for books.toscrape.com built with Python. Features robust error handling, rate limiting, and detailed logging while respecting the website's resources.
- 🚀 Concurrent scraping using ThreadPoolExecutor
- 📊 Automatic CSV export
- 🔄 Smart retry mechanism
- ⏱️ Rate limiting to respect server resources
- 📝 Comprehensive logging
- 🛡️ Robust error handling
- Python 3.7 or higher
- pip (Python package manager)
- Clone the repository:
git clone https://github.com/yourusername/books-scraper.git
cd books-scraper
- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Basic usage:
python scraper.py
With custom parameters:
python scraper.py --pages 100 --workers 5 --output books.csv
Key settings in scraper.py
:
MAX_RETRIES = 3 # Maximum retry attempts
TIMEOUT = 10 # Request timeout in seconds
MAX_WORKERS = 5 # Concurrent threads
RATE_LIMIT = 1 # Seconds between requests
OUTPUT_FILE = "books.csv"
The scraper saves data in CSV format with the following columns:
Column | Description |
---|---|
title | Book title |
price | Price in £ |
availability | Stock status |
rating | Star rating (1-5) |
url | Book detail page URL |
- Detailed logs saved to
scraper.log
- Console output for progress tracking
- Error tracking and reporting
- Automatic retry for failed requests
- Rate limit detection and handling
- Malformed HTML protection
- Network error recovery
- Concurrent page processing
- Connection pooling
- Session reuse
- Optimized memory usage
├── scraper.py # Main scraper code
├── README.md # Documentation
└── .gitignore # Git ignore file
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests
- Submit a pull request
- requests>=2.26.0
- beautifulsoup4>=4.9.3
- urllib3>=1.26.7
Common issues and solutions:
-
Rate limiting:
- Increase
RATE_LIMIT
value - Reduce
MAX_WORKERS
- Increase
-
Connection errors:
- Check internet connection
- Verify website availability
- Increase
TIMEOUT
value
- Books to Scrape for providing the test website
- BeautifulSoup4 developers for the excellent parsing library
- Python community for the concurrent.futures framework
Note: This scraper is for educational purposes only. Always check and respect a website's robots.txt and terms of service before scraping.