apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆928Updated last week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆417Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆246Updated last week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆194Updated this week
- A scalable frontier for web crawlers☆1,313Updated 2 months ago
- Work in progress transmit from Google Code☆1,122Updated 7 years ago
- Open-source Enterprise Grade Search Engine Software☆509Updated 2 years ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆274Updated 2 years ago
- This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.☆1,215Updated last year
- Banana for Solr - A Port of Kibana☆671Updated 3 weeks ago
- API definition, resources and reference implementation of URL Frontiers☆52Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆218Updated 2 years ago
- Apache Nutch is an extensible and scalable web crawler☆3,056Updated last month
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- ACHE is a web crawler for domain-specific search.☆471Updated this week
- Carrot2: Text Clustering Algorithms and Applications☆821Updated last week
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆3,031Updated last week
- A scrapy pipeline which send items to Elastic Search server☆325Updated 3 years ago
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆445Updated last year
- Crawljax☆527Updated last year
- Extract embedded metadata from HTML markup☆928Updated 5 months ago
- Solr query parser plugin that performs proper query-time synonym expansion.☆150Updated 4 years ago
- Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statisti…☆1,088Updated last year
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 6 years ago
- Language Detection Library for Java☆581Updated 3 years ago
- Dice Solr Plugins from Simon Hughes Dice.com☆88Updated 4 years ago
- Just the facts -- web page content extraction☆1,271Updated last month
- Apache OpenNLP☆1,532Updated this week
- A text tagger based on Lucene / Solr, using FST technology☆177Updated last year
- A curated list of Awesome Apache Solr links and resources.☆109Updated 3 years ago