apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆950Updated this week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆419Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆250Updated this week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆197Updated last week
- A scalable frontier for web crawlers☆1,324Updated 6 months ago
- Carrot2 plugin for ElasticSearch☆293Updated 2 years ago
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,220Updated 2 years ago
- ACHE is a web crawler for domain-specific search.☆474Updated 3 months ago
- Crawljax☆534Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆323Updated 3 years ago
- Work in progress transmit from Google Code☆1,126Updated 7 years ago
- Banana for Solr - A Port of Kibana☆672Updated 4 months ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆275Updated 3 years ago
- Elassandra = Elasticsearch + Apache Cassandra☆1,719Updated 6 months ago
- Carrot2: Text Clustering Algorithms and Applications☆836Updated 2 weeks ago
- Apache Nutch is an extensible and scalable web crawler☆3,095Updated 2 weeks ago
- Score documents with pure dot product / cosine similarity with ES☆254Updated 4 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆189Updated last week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆223Updated 2 years ago
- Language Detection Library for Java☆585Updated 3 years ago
- Mirror of Apache Samza☆833Updated 7 months ago
- Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch☆1,519Updated last month
- Just the facts -- web page content extraction☆1,276Updated 5 months ago
- A java library for stored queries☆377Updated 2 years ago
- Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in…☆1,035Updated 3 years ago
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 6 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated 3 months ago
- Mahout Taste-based recommendation on Elasticsearch☆334Updated 6 years ago
- Extract embedded metadata from HTML markup☆934Updated 2 months ago
- A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector☆251Updated 8 years ago
- Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statisti…☆1,086Updated 2 years ago