apache / incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆901Updated this week
Alternatives and similar repositories for incubator-stormcrawler:
Users that are interested in incubator-stormcrawler are comparing it to the libraries listed below
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆412Updated last year
- Apache Nutch is an extensible and scalable web crawler☆2,971Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- Banana for Solr - A Port of Kibana☆669Updated 6 months ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆186Updated this week
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆270Updated 2 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Fast Parallel Async HTTP client as a Service to monitor and manage 10,000 web servers. (Java+Akka)☆898Updated 7 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated 11 months ago
- A java library for stored queries☆375Updated last year
- Apache OpenNLP☆1,471Updated last week
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 5 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated last year
- Carrot2 plugin for ElasticSearch☆292Updated 2 years ago
- A repository of information, examples and good practices around the Lambda Architecture☆368Updated 7 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆183Updated this week
- Mirror of Apache Samza☆821Updated 2 months ago
- Netflix's distributed Data Pipeline☆795Updated last year
- Apache Geode☆2,299Updated last month
- The LAW next generation crawler.☆87Updated 3 years ago
- NER toolkit for HTML data☆259Updated 9 months ago
- Carrot2: Text Clustering Algorithms and Applications☆793Updated 4 months ago
- The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).☆2,757Updated this week
- Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages☆543Updated 3 years ago
- Divolte Collector☆281Updated 3 years ago
- Content Based Image Retrieval Plugin for Elasticsearch. It allows users to index images and search for similar images.☆408Updated 8 years ago
- Work in progress transmit from Google Code☆1,114Updated 7 years ago
- Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules☆4,386Updated 2 years ago
- HBase as a TinkerPop Graph Database☆256Updated this week
- Score documents with pure dot product / cosine similarity with ES☆250Updated 3 years ago