apache / incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆907Updated this week
Alternatives and similar repositories for incubator-stormcrawler
Users that are interested in incubator-stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆415Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- A scalable frontier for web crawlers☆1,310Updated 3 months ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆188Updated this week
- Language Detection Library for Java☆577Updated 2 years ago
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,205Updated last year
- Work in progress transmit from Google Code☆1,114Updated 7 years ago
- A scrapy pipeline which send items to Elastic Search server☆328Updated 2 years ago
- Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.☆382Updated 2 years ago
- Mapper Attachments Type plugin for Elasticsearch☆504Updated last year
- Banana for Solr - A Port of Kibana☆670Updated 9 months ago
- Apache OpenNLP☆1,509Updated this week
- Divolte Collector☆281Updated 3 years ago
- NER toolkit for HTML data☆259Updated last year
- A java library for stored queries☆375Updated 2 years ago
- Crawljax☆525Updated last year
- A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…☆2,238Updated this week
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Readability clone in Java☆459Updated 4 years ago
- Open-source Enterprise Grade Search Engine Software☆507Updated 2 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆215Updated 2 years ago
- Just the facts -- web page content extraction☆1,265Updated 10 months ago
- Elasticsearch entity resolution plugin based on Duke☆210Updated 4 years ago
- Automatically exported from code.google.com/p/chromium-compact-language-detector☆162Updated 4 years ago
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,783Updated 3 years ago
- Mirror of Apache Samza☆825Updated 2 weeks ago
- Fast Parallel Async HTTP client as a Service to monitor and manage 10,000 web servers. (Java+Akka)☆900Updated 8 years ago
- News crawling with StormCrawler - stores content as WARC☆344Updated 2 months ago
- Web Crawler for Elasticsearch☆235Updated 5 years ago
- Netflix's distributed Data Pipeline☆796Updated 2 years ago