apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆921Updated last week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆416Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated last week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆191Updated this week
- Work in progress transmit from Google Code☆1,117Updated 7 years ago
- Banana for Solr - A Port of Kibana☆671Updated 11 months ago
- A scrapy pipeline which send items to Elastic Search server☆326Updated 3 years ago
- Carrot2: Text Clustering Algorithms and Applications☆814Updated last week
- Just the facts -- web page content extraction☆1,268Updated last week
- ACHE is a web crawler for domain-specific search.☆469Updated last year
- Apache Nutch is an extensible and scalable web crawler☆3,043Updated this week
- Language Detection Library for Java☆581Updated 2 years ago
- Crawljax☆526Updated last year
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated last month
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆272Updated 2 years ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Elassandra = Elasticsearch + Apache Cassandra☆1,719Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆216Updated 2 years ago
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 5 years ago
- A java library for stored queries☆376Updated 2 years ago
- A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector☆252Updated 7 years ago
- Dice Solr Plugins from Simon Hughes Dice.com☆87Updated 4 years ago
- Divolte Collector☆281Updated 3 years ago
- Mapper Attachments Type plugin for Elasticsearch☆503Updated 2 years ago
- The LAW next generation crawler.☆87Updated 3 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated last year
- Netflix's distributed Data Pipeline☆797Updated 2 years ago
- Solr query parser plugin that performs proper query-time synonym expansion.☆150Updated 4 years ago
- Mahout Taste-based recommendation on Elasticsearch☆335Updated 5 years ago
- Elasticsearch/Solr Sandbox for exploring explain information and tweaking☆137Updated last year
- Duke is a fast and flexible deduplication engine written in Java☆623Updated last year