apache / incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆904Updated last week
Alternatives and similar repositories for incubator-stormcrawler:
Users that are interested in incubator-stormcrawler are comparing it to the libraries listed below
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆413Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆243Updated last week
- Apache Nutch is an extensible and scalable web crawler☆3,000Updated this week
- Elasticsearch real-time search and analytics natively integrated with Hadoop☆1,935Updated this week
- ACHE is a web crawler for domain-specific search.☆464Updated last year
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,782Updated 3 years ago
- Elassandra = Elasticsearch + Apache Cassandra☆1,714Updated last year
- A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…☆2,238Updated last week
- Banana for Solr - A Port of Kibana☆670Updated 7 months ago
- HBase as a TinkerPop Graph Database☆256Updated last week
- A software library of stochastic streaming algorithms, a.k.a. sketches.☆906Updated this week
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated last year
- Work in progress transmit from Google Code☆1,114Updated 7 years ago
- Migrate a Solr node to an Elasticsearch index.☆55Updated last year
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- Mirror of Apache Samza☆824Updated 3 weeks ago
- Carrot2: Text Clustering Algorithms and Applications☆799Updated last month
- Language Detection Library for Java☆575Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆328Updated 2 years ago
- A Java library to detect and normalize URLs in text☆783Updated 2 years ago
- LinkedIn's previous generation Kafka to HDFS pipeline.☆876Updated 4 years ago
- Highly configurable recommender based on PredictionIO and Mahout's Correlated Cross-Occurrence algorithm☆671Updated 5 years ago
- Tranquility helps you send real-time event streams to Druid and handles partitioning, replication, service discovery, and schema rollover…☆516Updated 5 years ago
- OpenRTB model for Java and other languages via protobuf; Helper OpenRTB libraries for Java including JSON serialization☆400Updated last year
- Apache Drill is a distributed MPP query layer for self describing data☆1,963Updated 2 weeks ago
- News crawling with StormCrawler - stores content as WARC☆339Updated last month
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆187Updated this week
- Distributed Big Data Orchestration Service☆1,725Updated last week
- Netflix's distributed Data Pipeline☆793Updated last year
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆270Updated 2 years ago