apache / incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆891Updated this week
Related projects ⓘ
Alternatives and complementary repositories for incubator-stormcrawler
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆410Updated last year
- A set of reusable Java components that implement functionality common to any web crawler☆237Updated this week
- A scalable frontier for web crawlers☆1,302Updated last year
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆183Updated this week
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,182Updated last year
- Apache Nutch is an extensible and scalable web crawler☆2,923Updated 3 weeks ago
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆2,837Updated this week
- Open-source Enterprise Grade Search Engine Software☆500Updated 2 years ago
- Mirror of Apache Samza☆820Updated last month
- Just the facts -- web page content extraction☆1,254Updated 4 months ago
- A scrapy pipeline which send items to Elastic Search server☆327Updated 2 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 6 years ago
- News crawling with StormCrawler - stores content as WARC☆322Updated 11 months ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆214Updated last year
- Elasticsearch real-time search and analytics natively integrated with Hadoop☆9Updated last week
- Banana for Solr - A Port of Kibana☆668Updated 3 months ago
- Readability clone in Java☆461Updated 4 years ago
- Work in progress transmit from Google Code☆1,109Updated 6 years ago
- A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…☆2,230Updated this week
- Apache Drill is a distributed MPP query layer for self describing data☆1,948Updated this week
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆445Updated 11 months ago
- ACHE is a web crawler for domain-specific search.☆454Updated last year
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆181Updated 6 years ago
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,788Updated 3 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- A java library for stored queries☆374Updated last year
- Command line client for Scrapyd server☆770Updated last month
- NER toolkit for HTML data☆257Updated 6 months ago
- Article extraction benchmark: dataset and evaluation scripts☆289Updated 6 months ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆119Updated 8 months ago