Selenium Scraper Quickstarter is a professional template for building robust and scalable web scrapers using Selenium and Flask, ready for local development, Docker containers, and cloud deployment.
- Selenium Automation: Advanced interaction with dynamic web pages.
- RESTful API with Flask: Secure and customizable endpoint exposure.
- Bearer Authentication: Security via configurable tokens.
- Environment Management: Environment variables for production, testing, and staging.
- Docker & Codespaces Support: Ready for containers and cloud development.
- Logging System: Activity and error logging for auditing and debugging.
- Automated Downloads: File and temporary directory management.
- Extensible & Modular: Clean architecture for easily adding actions and endpoints.
├── main.py # Flask entry point
├── actions/ # Scraping and automation logic
├── controller/ # Endpoint controllers
├── temp_downloads/ # Temporary downloads
├── utils/ # Utilities and configuration
├── requirements.txt # Python dependencies
├── Dockerfile # Production-ready Docker image
├── .env.example # Example environment configuration
└── README.md # This file
Configure scraper behavior via variables in the .env
file. Copy .env.example
to .env
and customize as needed.
Variable | Required | Possible Values / Example | Description |
---|---|---|---|
STAGE |
Yes | production , testing , staging |
Execution environment (affects visibility and real actions) |
VALID_TOKEN |
Yes | sample |
Bearer token to authenticate requests |
HEADLESS_MODE |
Optional | auto , True , False |
Controls if the browser is visible or headless |
AUTO_DELETE_LOGS |
Optional | True , False |
Automatically deletes old logs |
Note: See
.env.example
for more details and recommendations. Base URL: The base URL is now set in the constantBASE_URL
insideutils/config.py
.
To change the target site, edit the value ofBASE_URL
in that file.
git clone https://github.com/Ismola/selenium-scraper-quickstarter.git
cd selenium-scraper-quickstarter
- Copy
.env.example
to.env
and edit it as needed. - Make sure you have Python 3.x and Google Chrome installed.
- Set the base URL: Edit the
BASE_URL
constant inutils/config.py
to point to your target website.
- Install VS Code and the Dev Containers extension.
- Install Docker.
- Open the project in VS Code and select "Reopen in Container".
- Click "Code" > "Open with Codespaces" on GitHub.
- Wait for the environment to be set up automatically.
python3 -m venv venv
source venv/bin/activate # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt
python3 main.py
docker build -t selenium-scraper .
docker run --env-file .env -p 3000:3000 selenium-scraper
gunicorn -w 2 -b 0.0.0.0:3000 --timeout 600 main:app
- Change environment variables for CI/CD:
- Go to Settings > Secrets and variables > Actions in your GitHub repository.
- Add or update secrets like
STAGE
,VALID_TOKEN
as needed. - These will be injected into the Docker image during the build and publish process.
⚠️ Important:
If you publish the Docker image to a public registry, any environment variable (such asSTAGE
,VALID_TOKEN
) injected during build may be visible to anyone who downloads the image.
Never use production secrets or sensitive tokens in public images.
For private deployments, always use private registries and restrict access to your images.
-
Change Docker registry or image name:
- Edit the
REGISTRY
andIMAGE_NAME
variables in.github/workflows/docker-publish.yml
.
- Edit the
-
Trigger the publish workflow:
- By default, the Docker image is published only after a successful run of the
Test
workflow. - You can change the trigger to run on other branches or events by editing the
on:
section.
- By default, the Docker image is published only after a successful run of the
Tip: Adjust the port if you change the exposed port in
compose.yaml
.
All protected routes require the header:
Authorization: Bearer <VALID_TOKEN>
Method | Route | Description |
---|---|---|
GET | / |
Server health check |
GET | /sample |
Example endpoint (modifiable) |
curl -H "Authorization: Bearer sample" http://localhost:3000/sample
- Add your token in
.env
. - Set the base URL in
utils/config.py
by editing theBASE_URL
constant. - Create new endpoints in
main.py
. - Implement scraping logic in
actions/
and controllers incontroller/
. - Use utilities from
utils/
for logging, configuration, and helpers.
- main.py: Defines endpoints and starts Flask.
- controller/: Receives the request, validates, and calls the action.
- actions/: Executes scraping logic (Selenium).
- utils/: Configuration, helpers, and shared utilities.
- BASE_URL is defined in
utils/config.py
.
- BASE_URL is defined in
- temp_downloads/: Stores temporarily downloaded files.
There are several recommended ways to deploy your scraper in a production or staging environment:
The project is ready to be served using Gunicorn, a robust WSGI HTTP server for Python web applications. This is the method used in the provided Dockerfile.
To run with Gunicorn manually:
gunicorn -w 2 -b 0.0.0.0:3000 --timeout 600 main:app
-w 2
: Number of worker processes (adjust as needed).-b 0.0.0.0:3000
: Binds to all interfaces on port 3000.--timeout 600
: Increases timeout for long scraping tasks.
You can deploy the application using Docker, ensuring all dependencies and environment settings are consistent across environments.
Build and run the container:
docker build -t selenium-scraper .
docker run --env-file .env -p 3000:3000 selenium-scraper
The repository includes a compose.yaml
file for Docker Compose, which simplifies running the application with persistent storage
To deploy with Docker Compose:
docker compose up --build
./logs:/app/logs
: Persists application logs on your host machine for easier debugging and auditing../temp_downloads:/app/temp_downloads
: Stores downloaded files outside the container, so you don't lose data on container restarts.
Tip: You can customize the exposed ports and volume paths in
compose.yaml
as needed for your infrastructure.
This project includes a preconfigured GitHub Actions workflow for continuous integration and automated Docker image publishing. You can find the workflow files in .github/workflows/
.
-
Docker Smoke (
docker-smoke.yml
):- Builds the Docker image and verifies that the application starts correctly using Docker Compose.
- Performs a smoke test by accessing the root endpoint (
/
) to ensure the container responds.
-
CI Test Suite (
ci-test.yml
):- Runs after the smoke test.
- Installs dependencies and runs automated tests with
pytest
inside the Docker Compose environment. - Ensures the application passes tests before continuing the pipeline.
-
Docker Build & Publish (
docker-publish.yml
):- Runs only if the previous workflows are successful.
- Builds and publishes the Docker image to GitHub Container Registry (
ghcr.io
). - Uses environment variables and secrets configured in the repository.
⚙️ You can customize or extend these workflows by editing the files in
.github/workflows/
as needed for your CI/CD requirements.
The project includes automated tests located in the test/
folder, using pytest
and Flask's test client.
test/test_main.py
: Tests for the main endpoints defined inmain.py
.
-
Install dependencies if you haven't already:
pip install -r requirements.txt pip install pytest
-
Run the tests:
pytest --maxfail=1 --disable-warnings -v
💡 Tip: You can also run the tests automatically in the CI/CD flow with GitHub Actions.
- Edit or add files in the
test/
folder following the example intest_main.py
. - Use the Flask client to simulate HTTP requests and validate responses.
- To test new endpoints, create functions starting with
test_
and use theclient
fixture. - See the pytest documentation for more options and best practices.
This project uses custom Copilot instructions from Ismola/personal-copilot-instructions.
Each time the devcontainer starts, they are cloned and updated automatically in .github/instructions
.
- This may be due to incompatibility between Chrome and Chromedriver.
- Quick fix:
- Delete the drivers folder:
rm -rf ~/.wdm
- Restart the environment.
- Delete the drivers folder:
- Ensure environment variables are correctly set.
- Check generated logs for more details.
Pull requests and suggestions are welcome! Please open an issue to discuss major changes.
MIT License © Ismola