Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆194Updated last week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆248Updated last week
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆418Updated 2 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆933Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆220Updated 2 years ago
- Carrot2: Text Clustering Algorithms and Applications☆830Updated last week
- Java library for parsing semi-structured text files☆65Updated 3 years ago
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago
- The LAW next generation crawler.☆89Updated 3 years ago
- Java library for reading and writing WARC files with a typed API☆50Updated 2 weeks ago
- A curated list of Awesome Apache Solr links and resources.☆110Updated 4 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- Apache OpenNLP Sandbox☆44Updated this week
- High-security graph database☆64Updated 3 years ago
- Distributed processing framework for search solutions☆82Updated 2 years ago
- Solr AutoComplete implementation☆59Updated 7 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆33Updated 2 months ago
- Solr Query Segmenter for structuring unstructured queries☆22Updated 4 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆97Updated 8 years ago
- Fast in-memory graph structure, powering Gephi☆74Updated this week
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆127Updated 3 months ago
- A text tagger based on Lucene / Solr, using FST technology☆177Updated last year
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆187Updated this week
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆73Updated last year
- Apache Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a…☆98Updated 2 years ago
- The next generation of open source search☆93Updated 8 years ago
- Common web archive utility code.☆56Updated 2 months ago
- API definition, resources and reference implementation of URL Frontiers☆52Updated 2 months ago
- A high performance "thin wrapper" HTTP REST server on top of Apache Lucene☆145Updated last year
- Constellio 8☆23Updated 4 years ago
- Silk is a port of Kibana 4 project.☆71Updated 9 years ago