apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆933Updated last week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆418Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆246Updated 2 weeks ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆194Updated last week
- Apache Nutch is an extensible and scalable web crawler☆3,074Updated 2 weeks ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆274Updated 2 years ago
- ACHE is a web crawler for domain-specific search.☆473Updated last month
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Just the facts -- web page content extraction☆1,273Updated 3 months ago
- Banana for Solr - A Port of Kibana☆672Updated 2 months ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆220Updated 2 years ago
- Carrot2: Text Clustering Algorithms and Applications☆830Updated 2 weeks ago
- Language Detection Library for Java☆582Updated 3 years ago
- Elasticsearch File System Crawler (FS Crawler)☆1,410Updated this week
- Apache OpenNLP☆1,544Updated last week
- Duke is a fast and flexible deduplication engine written in Java☆625Updated last year
- Mapper Attachments Type plugin for Elasticsearch☆505Updated 2 years ago
- Extract embedded metadata from HTML markup☆932Updated this week
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆187Updated last week
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆57Updated 4 years ago
- API definition, resources and reference implementation of URL Frontiers☆52Updated this week
- A java library for stored queries☆378Updated 2 years ago
- A Java library to detect and normalize URLs in text☆783Updated 2 months ago
- Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statisti…☆1,088Updated last year
- Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.☆445Updated last month
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Migrate a Solr node to an Elasticsearch index.☆56Updated 2 years ago
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆3,068Updated last week
- ☆66Updated 8 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- News crawling with StormCrawler - stores content as WARC☆356Updated 7 months ago