commoncrawl / commoncrawl-crawlerLinks
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆220Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆122Updated last year
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Sample code, data, and configuration for the book☆188Updated 4 years ago
- Distributed Realtime Search with Lucene and MongoDB☆60Updated 7 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 2 months ago
- Elasticsearch plugin for b-bit minhash algorism☆63Updated last year
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- JAVA implementation of Multinomial Naive Bayes Text Classifier.☆96Updated 10 years ago
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 3 years ago
- Katta - distributed Lucene☆60Updated 12 years ago
- ☆265Updated 9 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- A library for financial and time series calculations on Apache Spark☆28Updated 9 years ago
- HBase as the backing store for the TF-IDF representations for Lucene☆109Updated 15 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆417Updated 2 years ago
- Trident-ML : A realtime online machine learning library☆383Updated last year
- A set of reusable Java components that implement functionality common to any web crawler☆247Updated this week
- A text tagger based on Lucene / Solr, using FST technology☆177Updated last year
- Using latent Dirichlet allocation (LDA) in Apache Lucene☆58Updated 12 years ago
- Trooper is a Java-module like framework for building applications using one of the supported runtime profiles. Currently supported profil…☆43Updated 2 years ago
- Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. …☆59Updated 9 years ago
- Educational Examle of a custom Lucene Query & Scorer☆48Updated 5 years ago
- complex event processing code☆104Updated 11 years ago
- command line tool for Apache Lucene☆163Updated 2 months ago
- Jetstream is a streaming processing framework☆114Updated 10 years ago
- Provides a SQL interface to your TinkerPop enabled graph db☆75Updated 2 years ago
- distributed realtime searchable database☆117Updated 11 years ago
- Storm / Solr Integration☆19Updated last year
- SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.☆427Updated 9 years ago
- SequenceIQ Hadoop examples☆115Updated 9 years ago