apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆959Updated this week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆419Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆251Updated 2 weeks ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆196Updated this week
- A scalable frontier for web crawlers☆1,325Updated 8 months ago
- Carrot2 plugin for ElasticSearch☆294Updated 3 years ago
- Work in progress transmit from Google Code☆1,127Updated 8 years ago
- Plugin to integrate Learning to Rank (aka machine learning for better relevance) with Elasticsearch☆1,525Updated 3 months ago
- Carrot2: Text Clustering Algorithms and Applications☆845Updated 2 weeks ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆276Updated 3 years ago
- Banana for Solr - A Port of Kibana☆672Updated 6 months ago
- Open-source Enterprise Grade Search Engine Software☆512Updated 3 years ago
- API definition, resources and reference implementation of URL Frontiers☆57Updated 2 weeks ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆222Updated 3 years ago
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,231Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆322Updated 3 years ago
- Crawljax☆536Updated 2 years ago
- Just the facts -- web page content extraction☆1,280Updated 6 months ago
- ACHE is a web crawler for domain-specific search.☆479Updated 5 months ago
- Apache Nutch is an extensible and scalable web crawler☆3,119Updated 2 weeks ago
- Language Detection Library for Java☆585Updated 3 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆189Updated this week
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆446Updated 5 months ago
- Extract embedded metadata from HTML markup☆943Updated 4 months ago
- Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning☆1,785Updated 4 years ago
- Elassandra = Elasticsearch + Apache Cassandra☆1,720Updated 8 months ago
- Elasticsearch/Solr Sandbox for exploring explain information and tweaking☆139Updated last year
- Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.☆391Updated last month
- A java library for stored queries☆378Updated 2 years ago
- Duke is a fast and flexible deduplication engine written in Java☆626Updated 2 years ago
- Apache OpenNLP☆1,578Updated last week