apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆908Updated this week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆416Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆243Updated last week
- Carrot2: Text Clustering Algorithms and Applications☆808Updated last week
- A scalable frontier for web crawlers☆1,309Updated this week
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,208Updated last year
- Apache Nutch is an extensible and scalable web crawler☆3,025Updated 2 months ago
- Banana for Solr - A Port of Kibana☆671Updated 9 months ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆272Updated 2 years ago
- ACHE is a web crawler for domain-specific search.☆468Updated last year
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆188Updated this week
- Crawljax☆526Updated last year
- A java library for stored queries☆375Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆327Updated 2 years ago
- A JSON aware developer's interface to Elasticsearch. Comes with handy machinery such as syntax highlighting, autocomplete, formatting and…☆383Updated 4 months ago
- Elassandra = Elasticsearch + Apache Cassandra☆1,717Updated last week
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Apache OpenNLP☆1,513Updated last week
- News crawling with StormCrawler - stores content as WARC☆346Updated 3 months ago
- Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statisti…☆1,086Updated last year
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 5 years ago
- Elasticsearch/Solr Sandbox for exploring explain information and tweaking☆137Updated last year
- Work in progress transmit from Google Code☆1,116Updated 7 years ago
- Content Based Image Retrieval Plugin for Elasticsearch. It allows users to index images and search for similar images.☆408Updated 8 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated last year
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,784Updated 3 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated last week
- Distributed Graph Database☆5,240Updated 2 years ago
- Just the facts -- web page content extraction☆1,266Updated 11 months ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Language Detection Library for Java☆578Updated 2 years ago