apache / incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆906Updated last week
Alternatives and similar repositories for incubator-stormcrawler:
Users that are interested in incubator-stormcrawler are comparing it to the libraries listed below
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆415Updated 2 years ago
- A scalable frontier for web crawlers☆1,309Updated 2 months ago
- A set of reusable Java components that implement functionality common to any web crawler☆243Updated 3 weeks ago
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,203Updated last year
- Apache Nutch is an extensible and scalable web crawler☆3,003Updated 3 weeks ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆188Updated this week
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Work in progress transmit from Google Code☆1,114Updated 7 years ago
- ACHE is a web crawler for domain-specific search.☆468Updated last year
- Web Content Extraction Through Machine Learning☆185Updated 11 years ago
- Apache OpenNLP☆1,507Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- A scrapy pipeline which send items to Elastic Search server☆328Updated 2 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated last year
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Solr query parser plugin that performs proper query-time synonym expansion.☆150Updated 3 years ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆271Updated 2 years ago
- A Java library to detect and normalize URLs in text☆782Updated 2 years ago
- Open-source Enterprise Grade Search Engine Software☆505Updated 2 years ago
- Mahout Taste-based recommendation on Elasticsearch☆335Updated 5 years ago
- Readability clone in Java☆459Updated 4 years ago
- A java library for stored queries☆375Updated 2 years ago
- Elasticsearch real-time search and analytics natively integrated with Hadoop☆1,937Updated last week
- A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector☆252Updated 7 years ago
- Divolte Collector☆281Updated 3 years ago
- HTTP API for Scrapy spiders☆855Updated 9 months ago
- Banana for Solr - A Port of Kibana☆670Updated 8 months ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated last week
- A high performance "thin wrapper" HTTP REST server on top of Apache Lucene☆143Updated 11 months ago