Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆186Updated this week
Alternatives and similar repositories for crawlers:
Users that are interested in crawlers are comparing it to the libraries listed below
- A set of reusable Java components that implement functionality common to any web crawler☆240Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated 2 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆898Updated this week
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆33Updated 3 months ago
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆22Updated 4 months ago
- Open-source Enterprise Grade Search Engine Software☆503Updated 2 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Common web archive utility code.☆52Updated last month
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆120Updated 11 months ago
- The LAW next generation crawler.☆87Updated 3 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.