apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

☆907

Alternatives and similar repositories for incubator-stormcrawler

Users that are interested in incubator-stormcrawler are comparing it to the libraries listed below

Sorting:

USCDataScience / sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
☆415Updated 2 years ago
crawler-commons / crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
☆244Updated 3 weeks ago
scrapinghub / frontera
A scalable frontier for web crawlers
☆1,310Updated 3 months ago
Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…
☆188Updated this week
optimaize / language-detector
Language Detection Library for Java
☆577Updated 2 years ago
istresearch / scrapy-cluster
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
☆1,205Updated last year
kohlschutter / boilerpipe
Work in progress transmit from Google Code
☆1,114Updated 7 years ago
jayzeng / scrapy-elasticsearch
A scrapy pipeline which send items to Elastic Search server
☆328Updated 2 years ago
neo4j-contrib / neo4j-mazerunner
Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.
☆382Updated 2 years ago
elastic / elasticsearch-mapper-attachments
Mapper Attachments Type plugin for Elasticsearch
☆504Updated last year
lucidworks / banana
Banana for Solr - A Port of Kibana
☆670Updated 9 months ago
apache / opennlp
Apache OpenNLP
☆1,509Updated this week
divolte / divolte-collector
Divolte Collector
☆281Updated 3 years ago
scrapinghub / webstruct
NER toolkit for HTML data
☆259Updated last year
flaxsearch / luwak
A java library for stored queries
☆375Updated 2 years ago
crawljax / crawljax
Crawljax
☆525Updated last year
apache / gobblin
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…
☆2,238Updated this week
apache / gora
The Apache Gora open source framework provides an in-memory data model and persistence for big data.
☆121Updated last year
karussell / snacktory
Readability clone in Java
☆459Updated 4 years ago
jaeksoft / opensearchserver
Open-source Enterprise Grade Search Engine Software
☆507Updated 2 years ago
commoncrawl / commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆215Updated 2 years ago
dragnet-org / dragnet
Just the facts -- web page content extraction
☆1,265Updated 10 months ago
YannBrrd / elasticsearch-entity-resolution
Elasticsearch entity resolution plugin based on Duke
☆210Updated 4 years ago
mikemccand / chromium-compact-language-detector
Automatically exported from code.google.com/p/chromium-compact-language-detector
☆162Updated 4 years ago
OryxProject / oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
☆1,783Updated 3 years ago
apache / samza
Mirror of Apache Samza
☆825Updated 2 weeks ago
eBay / restcommander
Fast Parallel Async HTTP client as a Service to monitor and manage 10,000 web servers. (Java+Akka)
☆900Updated 8 years ago
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
☆344Updated 2 months ago
codelibs / elasticsearch-river-web
Web Crawler for Elasticsearch
☆235Updated 5 years ago
Netflix / suro
Netflix's distributed Data Pipeline
☆796Updated 2 years ago