Skip to content

A Web Scraper tool Built to scrape Hackathon and Internship organisers website, will be Integrated in Future with Hack-a-Bot( A Discord Bot to give timely updates on such oppurtunities)

License

Notifications You must be signed in to change notification settings

JyotirmoyDas05/Hack-a-Bot-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Hack-a-Bot-Scraper ๐Ÿค–

A comprehensive web scraping tool designed to automatically collect hackathon information from multiple popular hackathon platforms. This tool helps developers, students, and hackathon enthusiasts stay updated with the latest hackathon opportunities by scraping and storing event details in a MongoDB database.

๐ŸŽฏ Features

  • Multi-Platform Support: Scrapes hackathons from multiple popular platforms:

    • Devpost: One of the largest hackathon hosting platforms
    • AllHackathons: Comprehensive hackathon listing website
    • Hack2Skill: Platform for skill-based hackathons and competitions
  • Intelligent Data Extraction: Extracts comprehensive hackathon details including:

    • Event name and description
    • Start and end dates
    • Registration deadlines
    • Event mode (Online/Offline/Hybrid)
    • Location information
    • Prize amounts and participant counts
    • Event images and URLs
    • Tags and categories
    • Timeline information
  • Database Integration: Stores all scraped data in MongoDB with duplicate prevention

  • Anti-Bot Protection: Uses undetected Chrome driver to bypass anti-bot measures

  • Error Handling: Robust error handling and logging for reliable operation

  • Modular Design: Each scraper is independently developed and can be run separately

๐Ÿ—๏ธ Project Structure

Hack-a-Bot-scraper/
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ Scraper/
    โ”œโ”€โ”€ __init__.py
    โ”œโ”€โ”€ __main__.py          # Main entry point
    โ”œโ”€โ”€ db.py               # Database operations
    โ”œโ”€โ”€ Procfile           # Heroku deployment configuration
    โ”œโ”€โ”€ requirements.txt   # Python dependencies
    โ””โ”€โ”€ web/               # Individual scrapers
        โ”œโ”€โ”€ __init__.py
        โ”œโ”€โ”€ allhackathon_scraper.py    # AllHackathons.com scraper
        โ”œโ”€โ”€ devpost_scraper.py         # Devpost.com scraper
        โ””โ”€โ”€ Hack2skill_scraper.py      # Hack2Skill.com scraper

๐Ÿš€ Installation

Prerequisites

  • Python 3.7+
  • Chrome/Chromium browser installed
  • MongoDB database (local or cloud)

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/Hack-a-Bot-scraper.git
    cd Hack-a-Bot-scraper
  2. Install dependencies:

    pip install -r Scraper/requirements.txt
  3. Install Chrome and ChromeDriver:

    • Install Google Chrome:
      Download and install the latest version of Google Chrome from the official website.

    • Download ChromeDriver:
      Download the ChromeDriver version that matches your installed Chrome browser from the ChromeDriver Downloads page.

    • Extract and Set Up ChromeDriver:

      1. Extract the downloaded chromedriver.exe to a folder, e.g., C:\chromedriver.
      2. Add the folder path to your Windows PATH environment variable:
        • Press Win + S, search for "Environment Variables", and open "Edit the system environment variables".
        • Click "Environment Variables".
        • Under "System variables", find and select the Path variable, then click "Edit".
        • Click "New" and enter the path to your ChromeDriver folder (e.g., C:\chromedriver).
        • Click "OK" to save and close all dialogs.
    • Verify Installation:

      • Open a new Command Prompt and run:
        chromedriver --version
      • You should see the installed ChromeDriver version displayed.

    Note: Ensure that the ChromeDriver version matches your installed Chrome browser version for compatibility.

  4. Environment Configuration: Create a .env file in the root directory:

    MONGO_URI=mongodb://localhost:27017/HackBot
    # Or for MongoDB Atlas:
    # MONGO_URI=mongodb+srv://username:password@cluster.mongodb.net/HackBot
  5. Database Setup:

    • Ensure MongoDB is running (local installation) or configure MongoDB Atlas
    • The application will automatically create the required database and collections

๐ŸŽฎ Usage

Running All Scrapers

cd Scraper
python __main__.py

Running Individual Scrapers

You can also run specific scrapers independently:

# Run Devpost scraper only
from Scraper.web import devpost_scraper
devpost_scraper.main()

# Run AllHackathons scraper only  
from Scraper.web import allhackathon_scraper
scraper = allhackathon_scraper.HackClub()
scraper.scrape_hackathons()

# Run Hack2Skill scraper only
from Scraper.web import Hack2skill_scraper
Hack2skill_scraper.main()

๐Ÿ“Š Data Schema

Each hackathon entry stored in the database contains:

{
    'name': str,           # Hackathon name
    'url': str,            # Event URL
    'image': str,          # Event image URL
    'start': str,          # Start date
    'end': str,            # End date
    'mode': str,           # Online/Offline/Hybrid
    'location': str,       # Event location
    'website': str,        # Source website (DEVPOST/ALLHACKATHONS/HACK2SKILL)
    'new': bool,           # Flag for new entries
    'prize_amount': str,   # Prize information (optional)
    'participants': str,   # Participant count (optional)
    'tags': list,          # Event tags/categories (optional)
    'timeline': str,       # Event timeline (optional)
    'deadline': str        # Registration deadline (optional)
}

๐Ÿ› ๏ธ Dependencies

  • selenium: Web automation and scraping
  • undetected-chromedriver: Anti-detection Chrome driver
  • pymongo: MongoDB database operations
  • python-dotenv: Environment variable management
  • dnspython: DNS resolution for MongoDB
  • urllib3: HTTP library for web requests

๐Ÿณ Deployment

The project includes a Procfile for easy deployment on platforms like Heroku:

# Deploy to Heroku
heroku create your-app-name
heroku config:set MONGO_URI=your_mongodb_connection_string
git push heroku main

๐Ÿ”ง Configuration

Chrome Driver Options

The scrapers use various Chrome options for optimal performance:

  • --no-sandbox: Bypass OS security model
  • --disable-dev-shm-usage: Overcome limited resource problems
  • --disable-gpu: Disable GPU hardware acceleration
  • --window-size=1920,1080: Set browser window size
  • --disable-blink-features=AutomationControlled: Hide automation indicators

Database Configuration

The database module (db.py) provides functions for:

  • save_hackathons(): Insert new hackathon data
  • get_hackathons(): Retrieve existing hackathons by website
  • delete_hackathons(): Remove specific hackathon entries

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Adding New Scrapers

To add a new hackathon platform scraper:

  1. Create a new file in Scraper/web/ directory
  2. Implement the scraping logic following the existing patterns
  3. Add database integration using the provided db.py functions
  4. Update __main__.py to include the new scraper
  5. Update this README with the new platform information

๐Ÿ“ License

This project is licensed under MIT License, the terms are specified in the LICENSE file.

โš ๏ธ Disclaimer

This tool is designed for educational and personal use. Please ensure you comply with the terms of service of the websites being scraped. The developers are not responsible for any misuse of this tool.

๐Ÿ› Known Issues & Troubleshooting

  • Chrome Driver Issues: Ensure Chrome/Chromium is installed and up to date
  • Database Connection: Verify MongoDB is running and connection string is correct
  • Rate Limiting: The scrapers include delays to respect website rate limits
  • Element Not Found: Some websites may change their structure over the course of time; and this scraper method may need updates

๐Ÿ“ˆ Future Enhancements

  • Add more hackathon platforms (MLH, HackerEarth, etc.)
  • Implement scheduling for automatic periodic scraping
  • Add email notifications for new hackathons
  • Create a web dashboard for viewing scraped data
  • Implement data filtering and search capabilities
  • Add export functionality (CSV, JSON)

๐Ÿ“ž Support

For questions, issues, or contributions, please open an issue on GitHub or connect with The Developer in LinkedIn in their Profile page

About

A Web Scraper tool Built to scrape Hackathon and Internship organisers website, will be Integrated in Future with Hack-a-Bot( A Discord Bot to give timely updates on such oppurtunities)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages