apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆948Updated this week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆418Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆247Updated last week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆196Updated last week
- A scalable frontier for web crawlers☆1,320Updated 5 months ago
- Work in progress transmit from Google Code☆1,125Updated 7 years ago
- Banana for Solr - A Port of Kibana☆672Updated 3 months ago
- Apache Nutch is an extensible and scalable web crawler☆3,089Updated last week
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,219Updated 2 years ago
- Crawljax☆535Updated 2 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆221Updated 2 years ago
- Carrot2: Text Clustering Algorithms and Applications☆835Updated 3 weeks ago
- Just the facts -- web page content extraction☆1,274Updated 4 months ago
- A pure-python HTML screen-scraping library☆1,887Updated 3 years ago
- Extract embedded metadata from HTML markup☆935Updated last month
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 6 years ago
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆3,089Updated 2 weeks ago
- ☆28Updated 9 years ago
- Language Detection Library for Java☆585Updated 3 years ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆274Updated 3 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Readability clone in Java☆460Updated 5 years ago
- Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statisti…☆1,087Updated last year
- Apache OpenNLP☆1,556Updated last week
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆188Updated last week
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,784Updated 4 years ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- open source big data integration, analytics, and visualization☆419Updated 8 years ago
- Esper Complex Event Processing, Streaming SQL and Event Series Analysis☆870Updated last year
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆205Updated 7 years ago
- Mapper Attachments Type plugin for Elasticsearch☆505Updated 2 years ago