🚀 EasyCrawler - The Easy and Powerful Web Crawler!

🌟 What is EasyCrawler?

Welcome to EasyCrawler, the web crawler that turns web exploration into an exciting adventure! 🕷️💻

Imagine an intelligent robot that navigates the internet like a digital detective, collecting valuable information from websites with ease and efficiency. EasyCrawler is built with modern Java, using the powerful Crawler4j library, and offers a flexible and extensible architecture to meet your crawling needs.

✨ Amazing Features

🔍 Smart Search: Implement custom search strategies with our SearchStrategy interface
🌐 Controlled Navigation: Configure crawling depth, page limits, and respect for robots.txt
📁 Flexible Persistence: Save results however you want - file, database, or any destination
⚡ High Performance: Multithreading for fast and efficient crawling
🛡️ Server Respect: Politeness control to avoid overloading sites
🎯 Advanced Filtering: Ignore binary files and focus on relevant content

🛠️ How to Use

Prerequisites

Java 21 or higher
Maven 3.6+

Installation

Add the following dependency to your pom.xml:

<dependency>
    <groupId>io.github.gabrielmesquitaoliveira</groupId>
    <artifactId>easycrawler</artifactId>
    <version>1.0.0</version>
</dependency>

EasyCrawler is available on Maven Central, so no additional repository configuration is needed.

Usage Example

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
import io.github.EasyCrawlerConfig;
import io.github.EasyCrawlerFactory;
import io.github.strategies.TextSearchStrategy;
import org.jsoup.nodes.Element;

import java.util.List;
import java.util.function.BiFunction;

// 1. Configure Crawler4j
CrawlConfig crawlConfig = new CrawlConfig();
crawlConfig.setCrawlStorageFolder("/tmp/crawler4j/");
crawlConfig.setPolitenessDelay(1000); // 1 second delay between requests
crawlConfig.setMaxDepthOfCrawling(2);
crawlConfig.setMaxPagesToFetch(100);
crawlConfig.setIncludeBinaryContentInCrawling(false);

// 2. Setup Crawler4j controller
PageFetcher pageFetcher = new PageFetcher(crawlConfig);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);

// 3. Add seed URLs
controller.addSeed("https://example.com");

// 4. Configure your search strategy
SearchStrategy strategy = new TextSearchStrategy("keyword");

// 5. Configure the persistence function
BiFunction<List<Element>, String, Void> persistFunc = (elements, url) -> {
    // Save the found elements
    System.out.println("Found at: " + url);
    elements.forEach(el -> System.out.println(el.text()));
    return null;
};

// 6. Create the configuration
EasyCrawlerConfig config = new EasyCrawlerConfig(
    "https://example.com",
    true, // Restrict to domain
    List.of(), // Allow all URIs
    List.of(), // Deny none
    strategy,
    persistFunc
);

// 7. Start crawling
controller.start(new EasyCrawlerFactory(config), 8); // 8 threads

🏗️ Architecture

EasyCrawler follows the Strategy design pattern for maximum flexibility:

EasyCrawler: The heart of the crawler, extends WebCrawler from Crawler4j
SearchStrategy: Interface to implement different types of search
EasyCrawlerConfig: Immutable configuration using Java Records
EasyCrawlerFactory: Factory to create crawler instances

📊 Included Search Strategies

TextSearchStrategy: Search for exact text in HTML elements
AttributeSearchStrategy: Search for elements with specific attribute values
CssSelectorSearchStrategy: Search using CSS selectors
RegexTextSearchStrategy: Search for text matching a regular expression
PriceRangeSearchStrategy: Search for price values within a specified range
XPathSearchStrategy: Search for elements using XPath expressions
ClassSearchStrategy: Search for elements with a specific class name
TagSearchStrategy: Search for elements by their tag name
(Add your own strategies by implementing SearchStrategy)

🤝 How to Contribute

We love contributions! 🚀

Fork the project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contribution Ideas

🆕 New search strategies (XPath, CSS Selector, Regex)
🌍 Support for more data formats (JSON, XML)
🔄 Data export options (CSV, JSON, etc.)
📦 Plugin system for custom functionality
📊 Database integration
🌐 Proxy and authentication support
📈 Crawling metrics and dashboards
🧪 Automated tests

📋 Dependencies

Crawler4j 4.2: Robust crawling engine
JSoup 1.19.1: Powerful HTML parser
Lombok 1.18.38: Reduces boilerplate code

📄 License

This project is under the MIT license - see the LICENSE file for details.

🎉 Let's Crawl!

Ready to explore the web like never before? EasyCrawler is here to make your data journey an amazing experience! 🌍✨

Made with ❤️

⭐ If this project helped you, give it a star on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src/main/java/io/github		src/main/java/io/github
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 EasyCrawler - The Easy and Powerful Web Crawler!

🌟 What is EasyCrawler?

✨ Amazing Features

🛠️ How to Use

Prerequisites

Installation

Usage Example

🏗️ Architecture

📊 Included Search Strategies

🤝 How to Contribute

Contribution Ideas

📋 Dependencies

📄 License

🎉 Let's Crawl!

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

License

GabrielMesquitaOliveira/EasyCrawler

Folders and files

Latest commit

History

Repository files navigation

🚀 EasyCrawler - The Easy and Powerful Web Crawler!

🌟 What is EasyCrawler?

✨ Amazing Features

🛠️ How to Use

Prerequisites

Installation

Usage Example

🏗️ Architecture

📊 Included Search Strategies

🤝 How to Contribute

Contribution Ideas

📋 Dependencies

📄 License

🎉 Let's Crawl!

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages