🚀 Selenium Scraper Quickstarter

Selenium Scraper Quickstarter is a professional template for building robust and scalable web scrapers using Selenium and Flask, ready for local development, Docker containers, and cloud deployment.

✨ Main Features

Selenium Automation: Advanced interaction with dynamic web pages.
RESTful API with Flask: Secure and customizable endpoint exposure.
Bearer Authentication: Security via configurable tokens.
Environment Management: Environment variables for production, testing, and staging.
Docker & Codespaces Support: Ready for containers and cloud development.
Logging System: Activity and error logging for auditing and debugging.
Automated Downloads: File and temporary directory management.
Extensible & Modular: Clean architecture for easily adding actions and endpoints.

📁 Project Structure

├── main.py                   # Flask entry point
├── actions/                  # Scraping and automation logic
├── controller/               # Endpoint controllers
├── temp_downloads/           # Temporary downloads
├── utils/                    # Utilities and configuration
├── requirements.txt          # Python dependencies
├── Dockerfile                # Production-ready Docker image
├── .env.example              # Example environment configuration
└── README.md                 # This file

⚙️ Environment Variables

Configure scraper behavior via variables in the .env file. Copy .env.example to .env and customize as needed.

Variable	Required	Possible Values / Example	Description
`STAGE`	Yes	`production`, `testing`, `staging`	Execution environment (affects visibility and real actions)
`VALID_TOKEN`	Yes	`sample`	Bearer token to authenticate requests
`HEADLESS_MODE`	Optional	`auto`, `True`, `False`	Controls if the browser is visible or headless
`AUTO_DELETE_LOGS`	Optional	`True`, `False`	Automatically deletes old logs

Note: See .env.example for more details and recommendations. Base URL: The base URL is now set in the constant BASE_URL inside utils/config.py.
To change the target site, edit the value of BASE_URL in that file.

🏁 Quick Start

1. Clone the repository

git clone https://github.com/Ismola/selenium-scraper-quickstarter.git
cd selenium-scraper-quickstarter

2. Set up your environment

Copy .env.example to .env and edit it as needed.
Make sure you have Python 3.x and Google Chrome installed.
Set the base URL: Edit the BASE_URL constant in utils/config.py to point to your target website.

3. Choose your development mode

Option A: Dev Container (Recommended)

Install VS Code and the Dev Containers extension.
Install Docker.
Open the project in VS Code and select "Reopen in Container".

Option B: GitHub Codespaces

Click "Code" > "Open with Codespaces" on GitHub.
Wait for the environment to be set up automatically.

Option C: Manual

python3 -m venv venv
source venv/bin/activate  # or .\venv\Scripts\activate on Windows
pip install -r requirements.txt

▶️ Running

Local

python3 main.py

Docker

docker build -t selenium-scraper .
docker run --env-file .env -p 3000:3000 selenium-scraper

Production (Gunicorn)

gunicorn -w 2 -b 0.0.0.0:3000 --timeout 600 main:app

⚙️ CI/CD and Workflow Customization

Change environment variables for CI/CD:
- Go to Settings > Secrets and variables > Actions in your GitHub repository.
- Add or update secrets like STAGE, VALID_TOKEN as needed.
- These will be injected into the Docker image during the build and publish process.

⚠️ Important:
If you publish the Docker image to a public registry, any environment variable (such as STAGE, VALID_TOKEN) injected during build may be visible to anyone who downloads the image.
Never use production secrets or sensitive tokens in public images.
For private deployments, always use private registries and restrict access to your images.

Change Docker registry or image name:
- Edit the REGISTRY and IMAGE_NAME variables in .github/workflows/docker-publish.yml.
Trigger the publish workflow:
- By default, the Docker image is published only after a successful run of the Test workflow.
- You can change the trigger to run on other branches or events by editing the on: section.

Tip: Adjust the port if you change the exposed port in compose.yaml.

🔗 API Usage

Authentication

All protected routes require the header:

Authorization: Bearer <VALID_TOKEN>

Default Endpoints

Method	Route	Description
GET	`/`	Server health check
GET	`/sample`	Example endpoint (modifiable)

Example with `curl`

curl -H "Authorization: Bearer sample" http://localhost:3000/sample

🛠️ Customization & Extension

Add your token in .env.
Set the base URL in utils/config.py by editing the BASE_URL constant.
Create new endpoints in main.py.
Implement scraping logic in actions/ and controllers in controller/.
Use utilities from utils/ for logging, configuration, and helpers.

🧩 Architecture & Flow

main.py: Defines endpoints and starts Flask.
controller/: Receives the request, validates, and calls the action.
actions/: Executes scraping logic (Selenium).
utils/: Configuration, helpers, and shared utilities.
- BASE_URL is defined in utils/config.py.
temp_downloads/: Stores temporarily downloaded files.

🚢 Deployment

There are several recommended ways to deploy your scraper in a production or staging environment:

1. Gunicorn (Recommended for Production)

The project is ready to be served using Gunicorn, a robust WSGI HTTP server for Python web applications. This is the method used in the provided Dockerfile.

To run with Gunicorn manually:

gunicorn -w 2 -b 0.0.0.0:3000 --timeout 600 main:app

-w 2: Number of worker processes (adjust as needed).
-b 0.0.0.0:3000: Binds to all interfaces on port 3000.
--timeout 600: Increases timeout for long scraping tasks.

2. Docker (Recommended for Consistency)

You can deploy the application using Docker, ensuring all dependencies and environment settings are consistent across environments.

Build and run the container:

docker build -t selenium-scraper .
docker run --env-file .env -p 3000:3000 selenium-scraper

3. Docker Compose (For Multi-Service and Volume Management)

The repository includes a compose.yaml file for Docker Compose, which simplifies running the application with persistent storage

To deploy with Docker Compose:

docker compose up --build

Volumes in Compose

./logs:/app/logs: Persists application logs on your host machine for easier debugging and auditing.
./temp_downloads:/app/temp_downloads: Stores downloaded files outside the container, so you don't lose data on container restarts.

Tip: You can customize the exposed ports and volume paths in compose.yaml as needed for your infrastructure.

⚙️ GitHub Actions CI/CD

This project includes a preconfigured GitHub Actions workflow for continuous integration and automated Docker image publishing. You can find the workflow files in .github/workflows/.

Included Workflows

Docker Smoke (docker-smoke.yml):
- Builds the Docker image and verifies that the application starts correctly using Docker Compose.
- Performs a smoke test by accessing the root endpoint (/) to ensure the container responds.
CI Test Suite (ci-test.yml):
- Runs after the smoke test.
- Installs dependencies and runs automated tests with pytest inside the Docker Compose environment.
- Ensures the application passes tests before continuing the pipeline.
Docker Build & Publish (docker-publish.yml):
- Runs only if the previous workflows are successful.
- Builds and publishes the Docker image to GitHub Container Registry (ghcr.io).
- Uses environment variables and secrets configured in the repository.

⚙️ You can customize or extend these workflows by editing the files in .github/workflows/ as needed for your CI/CD requirements.

🧪 Automated Testing

The project includes automated tests located in the test/ folder, using pytest and Flask's test client.

📂 Test Structure

test/test_main.py: Tests for the main endpoints defined in main.py.

▶️ How to run the tests

Install dependencies if you haven't already:

pip install -r requirements.txt
pip install pytest

Run the tests:

pytest --maxfail=1 --disable-warnings -v

💡 Tip: You can also run the tests automatically in the CI/CD flow with GitHub Actions.

✏️ How to modify or add tests?

Edit or add files in the test/ folder following the example in test_main.py.
Use the Flask client to simulate HTTP requests and validate responses.
To test new endpoints, create functions starting with test_ and use the client fixture.
See the pytest documentation for more options and best practices.

🤖 Custom instructions for GitHub Copilot

This project uses custom Copilot instructions from Ismola/personal-copilot-instructions.
Each time the devcontainer starts, they are cloned and updated automatically in .github/instructions.

🐞 Troubleshooting

Common error: `local variable 'driver' referenced before assignment`

This may be due to incompatibility between Chrome and Chromedriver.
Quick fix:
1. Delete the drivers folder: rm -rf ~/.wdm
2. Restart the environment.

Other issues

Ensure environment variables are correctly set.
Check generated logs for more details.

📚 Resources & Bibliography

🤝 Contributions

Pull requests and suggestions are welcome! Please open an issue to discuss major changes.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
actions		actions
controller		controller
logs		logs
test		test
utils		utils
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
compose.yaml		compose.yaml
main.py		main.py
requirements.txt		requirements.txt

License

Ismola/selenium-scraper-quickstarter

Folders and files

Latest commit

History

Repository files navigation

🚀 Selenium Scraper Quickstarter

✨ Main Features

📁 Project Structure

⚙️ Environment Variables

🏁 Quick Start

1. Clone the repository

2. Set up your environment

3. Choose your development mode

Option A: Dev Container (Recommended)

Option B: GitHub Codespaces

Option C: Manual

▶️ Running

Local

Docker

Production (Gunicorn)

⚙️ CI/CD and Workflow Customization

🔗 API Usage

Authentication

Default Endpoints

Example with curl

🛠️ Customization & Extension

🧩 Architecture & Flow

🚢 Deployment

1. Gunicorn (Recommended for Production)

2. Docker (Recommended for Consistency)

3. Docker Compose (For Multi-Service and Volume Management)

Volumes in Compose

⚙️ GitHub Actions CI/CD

Included Workflows

🧪 Automated Testing

📂 Test Structure

▶️ How to run the tests

✏️ How to modify or add tests?

🤖 Custom instructions for GitHub Copilot

🐞 Troubleshooting

Common error: local variable 'driver' referenced before assignment

Other issues

📚 Resources & Bibliography

🤝 Contributions

📝 License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Example with `curl`

Common error: `local variable 'driver' referenced before assignment`

Packages