apache / stormcrawlerLinks
A scalable, mature and versatile web crawler based on Apache Storm
☆932Updated last week
Alternatives and similar repositories for stormcrawler
Users that are interested in stormcrawler are comparing it to the libraries listed below
Sorting:
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆417Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆247Updated this week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆194Updated last week
- Apache Nutch is an extensible and scalable web crawler☆3,068Updated this week
- ACHE is a web crawler for domain-specific search.☆472Updated 2 weeks ago
- Work in progress transmit from Google Code☆1,122Updated 7 years ago
- Banana for Solr - A Port of Kibana☆671Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆220Updated 2 years ago
- A scrapy pipeline which send items to Elastic Search server☆325Updated 3 years ago
- Carrot2: Text Clustering Algorithms and Applications☆824Updated last week
- Crawljax☆530Updated last year
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆274Updated 2 years ago
- Language Detection Library for Java☆582Updated 3 years ago
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 6 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- ☆28Updated 9 years ago
- open source big data integration, analytics, and visualization☆421Updated 8 years ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Data Integration Graph☆207Updated 7 years ago
- A java library for stored queries☆378Updated 2 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆187Updated last week
- HBase as a TinkerPop Graph Database☆261Updated last week
- A curated list of Awesome Apache Solr links and resources.☆109Updated 3 years ago
- Apache OpenNLP☆1,539Updated this week
- Compact in-memory representation of directed graph data☆563Updated 2 years ago
- Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.☆3,059Updated last week
- Highly configurable recommender based on PredictionIO and Mahout's Correlated Cross-Occurrence algorithm☆675Updated 6 years ago
- A pure-python HTML screen-scraping library☆1,883Updated 3 years ago
- A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, orga…☆2,246Updated last week
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago