Welcome to EasyCrawler, the web crawler that turns web exploration into an exciting adventure! 🕷️💻
Imagine an intelligent robot that navigates the internet like a digital detective, collecting valuable information from websites with ease and efficiency. EasyCrawler is built with modern Java, using the powerful Crawler4j library, and offers a flexible and extensible architecture to meet your crawling needs.
- 🔍 Smart Search: Implement custom search strategies with our
SearchStrategy
interface - 🌐 Controlled Navigation: Configure crawling depth, page limits, and respect for robots.txt
- 📁 Flexible Persistence: Save results however you want - file, database, or any destination
- ⚡ High Performance: Multithreading for fast and efficient crawling
- 🛡️ Server Respect: Politeness control to avoid overloading sites
- 🎯 Advanced Filtering: Ignore binary files and focus on relevant content
- Java 21 or higher
- Maven 3.6+
Add the following dependency to your pom.xml
:
<dependency>
<groupId>io.github.gabrielmesquitaoliveira</groupId>
<artifactId>easycrawler</artifactId>
<version>1.0.0</version>
</dependency>
EasyCrawler is available on Maven Central, so no additional repository configuration is needed.
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
import io.github.EasyCrawlerConfig;
import io.github.EasyCrawlerFactory;
import io.github.strategies.TextSearchStrategy;
import org.jsoup.nodes.Element;
import java.util.List;
import java.util.function.BiFunction;
// 1. Configure Crawler4j
CrawlConfig crawlConfig = new CrawlConfig();
crawlConfig.setCrawlStorageFolder("/tmp/crawler4j/");
crawlConfig.setPolitenessDelay(1000); // 1 second delay between requests
crawlConfig.setMaxDepthOfCrawling(2);
crawlConfig.setMaxPagesToFetch(100);
crawlConfig.setIncludeBinaryContentInCrawling(false);
// 2. Setup Crawler4j controller
PageFetcher pageFetcher = new PageFetcher(crawlConfig);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);
// 3. Add seed URLs
controller.addSeed("https://example.com");
// 4. Configure your search strategy
SearchStrategy strategy = new TextSearchStrategy("keyword");
// 5. Configure the persistence function
BiFunction<List<Element>, String, Void> persistFunc = (elements, url) -> {
// Save the found elements
System.out.println("Found at: " + url);
elements.forEach(el -> System.out.println(el.text()));
return null;
};
// 6. Create the configuration
EasyCrawlerConfig config = new EasyCrawlerConfig(
"https://example.com",
true, // Restrict to domain
List.of(), // Allow all URIs
List.of(), // Deny none
strategy,
persistFunc
);
// 7. Start crawling
controller.start(new EasyCrawlerFactory(config), 8); // 8 threads
EasyCrawler follows the Strategy design pattern for maximum flexibility:
EasyCrawler
: The heart of the crawler, extends WebCrawler from Crawler4jSearchStrategy
: Interface to implement different types of searchEasyCrawlerConfig
: Immutable configuration using Java RecordsEasyCrawlerFactory
: Factory to create crawler instances
- TextSearchStrategy: Search for exact text in HTML elements
- AttributeSearchStrategy: Search for elements with specific attribute values
- CssSelectorSearchStrategy: Search using CSS selectors
- RegexTextSearchStrategy: Search for text matching a regular expression
- PriceRangeSearchStrategy: Search for price values within a specified range
- XPathSearchStrategy: Search for elements using XPath expressions
- ClassSearchStrategy: Search for elements with a specific class name
- TagSearchStrategy: Search for elements by their tag name
- (Add your own strategies by implementing
SearchStrategy
)
We love contributions! 🚀
- Fork the project
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- 🆕 New search strategies (XPath, CSS Selector, Regex)
- 🌍 Support for more data formats (JSON, XML)
- 🔄 Data export options (CSV, JSON, etc.)
- 📦 Plugin system for custom functionality
- 📊 Database integration
- 🌐 Proxy and authentication support
- 📈 Crawling metrics and dashboards
- 🧪 Automated tests
- Crawler4j 4.2: Robust crawling engine
- JSoup 1.19.1: Powerful HTML parser
- Lombok 1.18.38: Reduces boilerplate code
This project is under the MIT license - see the LICENSE file for details.
Ready to explore the web like never before? EasyCrawler is here to make your data journey an amazing experience! 🌍✨
Made with ❤️
⭐ If this project helped you, give it a star on GitHub!