Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆183Updated this week
Related projects ⓘ
Alternatives and complementary repositories for crawlers
- A set of reusable Java components that implement functionality common to any web crawler☆237Updated this week
- The LAW next generation crawler.☆86Updated 3 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆33Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated last year
- Open-source Enterprise Grade Search Engine Software☆499Updated 2 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Java library for reading and writing WARC files with a typed API☆48Updated this week
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆106Updated 7 months ago
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆22Updated last month
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆70Updated 7 months ago
- an open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)☆54Updated 6 years ago
- Mireka mail server and SMTP proxy☆40Updated 3 years ago
- Common web archive utility code.☆50Updated last month
- Implementation of Norconex Committer for Elasticsearch.☆11Updated 2 years ago
- 📘 A Citation Style Language (CSL) processor for Java.☆89Updated 4 months ago
- Java library for parsing semi-structured text files☆64Updated 3 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 9 years ago
- API definition, resources and reference implementation of URL Frontiers☆46Updated this week
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- Open Source, Distributed, Big Data Enterprise Search Engine☆69Updated last week
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- Official Java implementation of the Matomo Tracking HTTP API.☆69Updated this week
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆183Updated this week
- A text tagger based on Lucene / Solr, using FST technology☆176Updated 11 months ago
- Repackaging of Boilerpipe published on Maven Central Repository.☆53Updated 11 months ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆410Updated last year
- A bundle of useful Elasticsearch plugins☆110Updated 7 months ago
- Migrate a Solr node to an Elasticsearch index.☆55Updated last year
- Java access to Neo4J graph databases at multiple levels of abstraction☆86Updated 4 years ago