Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆188Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆216Updated 2 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆914Updated this week
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- ☆28Updated 8 years ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆416Updated 2 years ago
- High-security graph database☆63Updated 2 years ago
- command line tool for Apache Lucene☆162Updated 2 months ago
- The LAW next generation crawler.☆87Updated 3 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆282Updated 7 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆96Updated 7 years ago
- Java access to Neo4J graph databases at multiple levels of abstraction☆85Updated 4 years ago
- Mirror of Apache James Mime4j☆54Updated 2 months ago
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆103Updated 2 weeks ago
- DKPro JWPL (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all informat…☆86Updated this week
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- Distributed processing framework for search solutions☆81Updated 2 years ago
- Java library for parsing semi-structured text files☆65Updated 3 years ago
- Java port of langid.py (language identifier)☆28Updated 12 years ago
- Distributed Realtime Search with Lucene and MongoDB☆59Updated 7 years ago
- Web Crawler for Elasticsearch☆235Updated 5 years ago
- API definition, resources and reference implementation of URL Frontiers☆48Updated last month
- Custom graph algorithms for Neo4j with own Java and REST APIs☆35Updated 8 years ago
- Apache OpenNLP Sandbox☆43Updated this week
- ACHE is a web crawler for domain-specific search.☆468Updated last year
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆72Updated last year
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago