commoncrawl / commoncrawl-crawlerLinks
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆216Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 3 weeks ago
- Sample code, data, and configuration for the book☆188Updated 4 years ago
- Distributed Realtime Search with Lucene and MongoDB☆59Updated 7 years ago
- command line tool for Apache Lucene☆163Updated 3 weeks ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated last week
- Katta - distributed Lucene☆60Updated 12 years ago
- Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. …☆58Updated 9 years ago
- Educational Examle of a custom Lucene Query & Scorer☆48Updated 5 years ago
- A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It…☆201Updated 5 years ago
- Some example code of using Akka from Java☆121Updated 10 years ago
- Trident-ML : A realtime online machine learning library☆382Updated last year
- Graph Processing Algorithms on top of Neo4j☆39Updated 8 years ago
- Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop☆244Updated 9 years ago
- complex event processing code☆103Updated 11 years ago
- Jetstream is a streaming processing framework☆113Updated 9 years ago
- Customer Product search clicks analytics using big data Hadoop, Hive, Oozie, ElasticSearch, Akka, Spring Data☆73Updated 2 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- A platform for visualization and real-time monitoring of data workflows☆1,173Updated 5 years ago
- Example application on how to use mongo-hadoop connector with Spark☆91Updated 11 years ago
- Lucene Auto Phrase TokenFilter implementation☆59Updated 7 years ago
- JAVA implementation of Multinomial Naive Bayes Text Classifier.☆95Updated 10 years ago
- Distributed processing framework for search solutions☆81Updated 2 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆192Updated last week
- Storm / Solr Integration☆19Updated last year
- ☆263Updated 9 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆282Updated 7 years ago
- SequenceIQ Hadoop examples☆115Updated 9 years ago
- Mirror of Apache Lens☆60Updated 5 years ago
- Machine learning components for Apache UIMA☆129Updated 2 years ago