Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆195Updated last week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆246Updated last week
- A scalable, mature and versatile web crawler based on Apache Storm☆946Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆221Updated 2 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆418Updated 2 years ago
- API definition, resources and reference implementation of URL Frontiers☆52Updated last week
- Java library for parsing semi-structured text files☆65Updated 4 years ago
- High-security graph database☆64Updated 3 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆33Updated 3 months ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆73Updated last year
- A curated list of Awesome Apache Solr links and resources.☆110Updated 4 years ago
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago
- Apache OpenNLP Sandbox☆44Updated this week
- Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
- TinkerPop3 Graph Structure Implementation for OrientDB☆94Updated 3 weeks ago
- Java library for reading and writing WARC files with a typed API☆50Updated last month
- Open Source, Distributed, Big Data Enterprise Search Engine☆86Updated last month
- Common web archive utility code.☆56Updated 2 weeks ago
- Palladian is a Java-based toolkit with functionality for text processing, classification, information extraction, and data retrieval from…☆40Updated last week
- Mirror of Apache ManifoldCF☆80Updated last month
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆130Updated 4 months ago
- Document Ingestion Framework for Search Systems☆37Updated last month
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆98Updated 8 years ago
- The next generation of open source search☆93Updated 8 years ago
- UADetector is a library to identify over 190 different desktop and mobile browsers and 130 other User-Agents like feed readers, email cli…☆248Updated 3 years ago
- The LAW next generation crawler.☆88Updated 4 years ago
- Silk is a port of Kibana 4 project.☆70Updated 9 years ago
- Java text categorization system☆57Updated 8 years ago
- Studio web tool☆125Updated 3 weeks ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago