Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆197Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆250Updated last week
- A scalable, mature and versatile web crawler based on Apache Storm☆950Updated last week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆222Updated 2 years ago
- Open-source Enterprise Grade Search Engine Software☆512Updated 3 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆419Updated 2 years ago
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago
- Apache OpenNLP Sandbox☆45Updated this week
- Document Ingestion Framework for Search Systems☆37Updated 2 months ago
- A curated list of Awesome Apache Solr links and resources.☆110Updated 4 years ago
- Java library for reading and writing WARC files with a typed API☆50Updated 2 months ago
- TinkerPop3 Graph Structure Implementation for OrientDB☆94Updated 3 weeks ago
- Fast in-memory graph structure, powering Gephi☆75Updated this week
- High-security graph database☆64Updated 3 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Mirror of Apache ManifoldCF☆80Updated 2 months ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆72Updated last year
- Zulia Search Engine☆33Updated 2 weeks ago
- Carrot2: Text Clustering Algorithms and Applications☆836Updated 2 weeks ago
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆189Updated this week
- An ORM / OGM for the TinkerPop graph stack.☆139Updated 3 years ago
- Common web archive utility code.☆57Updated this week
- Silk is a port of Kibana 4 project.☆70Updated 9 years ago
- API definition, resources and reference implementation of URL Frontiers☆54Updated 3 weeks ago
- Distributed processing framework for search solutions☆82Updated 2 years ago
- Solr Query Segmenter for structuring unstructured queries☆22Updated 4 years ago
- SKOS Support for Apache Lucene and Solr☆56Updated 4 years ago
- Open Source, Distributed, Big Data Enterprise Search Engine☆90Updated 2 weeks ago
- The LAW next generation crawler.☆89Updated 4 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆98Updated 8 years ago
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆130Updated 5 months ago