Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆194Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆245Updated this week
- A scalable, mature and versatile web crawler based on Apache Storm☆928Updated last week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆218Updated 2 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆417Updated 2 years ago
- Distributed processing framework for search solutions☆82Updated 2 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- Mirror of Apache ManifoldCF☆80Updated last month
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- High-security graph database☆65Updated 3 years ago
- The LAW next generation crawler.☆88Updated 3 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆73Updated last year
- Fast in-memory graph structure, powering Gephi☆75Updated last week
- Carrot2: Text Clustering Algorithms and Applications☆818Updated last week
- Apache OpenNLP Sandbox☆43Updated last week
- Open Source, Distributed, Big Data Enterprise Search Engine☆85Updated this week
- Solr Query Segmenter for structuring unstructured queries☆22Updated 4 years ago
- Solr Redis Extensions☆53Updated last year
- A curated list of Awesome Apache Solr links and resources.☆109Updated 3 years ago
- Zulia Search Engine☆33Updated 3 weeks ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆97Updated 8 years ago
- Common web archive utility code.☆56Updated last month
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆33Updated last month
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆27Updated last month
- Java text categorization system☆57Updated 8 years ago
- A text tagger based on Lucene / Solr, using FST technology☆177Updated last year
- Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
- Apache NiFi Custom Processor Extracting Text From Files with Apache Tika☆35Updated 2 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Extra pluggable modules for Apache MetaModel (but licensed with LGPL)☆17Updated 3 years ago
- Demonstration of searching PDF document with Solr, Tika, and Tesseract☆31Updated 10 months ago