apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆919Updated this week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆416Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated last week
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,208Updated last year
- A scalable frontier for web crawlers☆1,312Updated 2 weeks ago
- Work in progress transmit from Google Code☆1,116Updated 7 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆190Updated 3 weeks ago
- Crawljax☆526Updated last year
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆216Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆326Updated 3 years ago
- A curated list of Awesome Apache Solr links and resources.☆109Updated 3 years ago
- Apache OpenNLP☆1,516Updated this week
- Banana for Solr - A Port of Kibana☆671Updated 10 months ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated 3 weeks ago
- Ghost Driver is an implementation of the Remote WebDriver Wire protocol, using PhantomJS as back-end☆1,910Updated 6 years ago
- NER toolkit for HTML data☆259Updated last year
- Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls☆273Updated 3 months ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Open-source Enterprise Grade Search Engine Software☆507Updated 2 years ago
- A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…☆2,239Updated last month
- Content Based Image Retrieval Plugin for Elasticsearch. It allows users to index images and search for similar images.☆408Updated 8 years ago
- Mahout Taste-based recommendation on Elasticsearch☆335Updated 5 years ago
- Elasticsearch real-time search and analytics natively integrated with Hadoop☆1,942Updated this week
- Fast Parallel Async HTTP client as a Service to monitor and manage 10,000 web servers. (Java+Akka)☆900Updated 8 years ago
- Duke is a fast and flexible deduplication engine written in Java☆623Updated last year
- Web Content Extraction Through Machine Learning☆185Updated 11 years ago
- A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector☆252Updated 7 years ago
- Entity resolution for Elasticsearch.☆160Updated 5 months ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆272Updated 2 years ago
- Extract embedded metadata from HTML markup☆919Updated 3 months ago