commoncrawl / commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆213Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler:
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- distributed realtime searchable database☆117Updated 10 years ago
- Jetstream is a streaming processing framework☆113Updated 9 years ago
- Katta - distributed Lucene☆60Updated 11 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 4 years ago
- Elasticsearch Index Termlist☆117Updated 5 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 6 years ago
- ☆264Updated 9 years ago
- Distributed Realtime Search with Lucene and MongoDB☆59Updated 6 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Github mirror of "search/extra" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for c…☆53Updated last month
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 5 years ago
- HBase as the backing store for the TF-IDF representations for Lucene☆108Updated 14 years ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 5 years ago
- PredictionIO Java SDK☆105Updated 6 years ago
- Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop☆243Updated 9 years ago
- Fabric-based framework for deploying and managing SolrCloud clusters in the cloud.☆90Updated 6 years ago
- Elasticsearch plugin for b-bit minhash algorism☆62Updated 9 months ago
- The next generation of open source search☆91Updated 7 years ago
- The Kiji project suite☆33Updated 9 years ago
- ☆28Updated 8 years ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 3 years ago
- Example application on how to use mongo-hadoop connector with Spark☆90Updated 11 years ago
- Trident-ML : A realtime online machine learning library☆381Updated last year
- Mensa is a generic, flexible, enhanced, and efficient Java implementation of a pattern matching state machine as described by the 1975 pa…☆94Updated 9 years ago
- NLP tools developed by Emory University.☆60Updated 8 years ago
- Solr Redis Extensions☆52Updated last year