commoncrawl / commoncrawl-crawlerLinks
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆223Updated 3 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Katta - distributed Lucene☆60Updated 12 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆251Updated last week
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Set of real time stream processing algorithms that can be used by big data streaming platform☆73Updated 6 months ago
- Storm / Solr Integration☆19Updated last year
- Machine learning components for Apache UIMA☆132Updated 2 years ago
- Sample code, data, and configuration for the book☆189Updated 4 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- distributed realtime searchable database☆117Updated 11 years ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 6 years ago
- Flowmix is a flexible event processing engine for Apache Storm. It supports complex correlations of events via sliding/tumbling windows. …☆59Updated 9 years ago
- Trident-ML : A realtime online machine learning library☆384Updated 2 years ago
- Mirror of Apache Lens☆62Updated 6 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata☆26Updated 7 years ago
- Next-generation web analytics processing with Scala, Spark, and Parquet.☆331Updated 10 years ago
- SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.☆428Updated 9 years ago
- Educational Examle of a custom Lucene Query & Scorer☆48Updated 5 years ago
- Distributed processing framework for search solutions☆82Updated 3 years ago
- Spark examples☆41Updated last year
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 5 years ago
- command line tool for Apache Lucene☆164Updated 3 weeks ago
- Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop☆244Updated 10 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆418Updated 2 years ago
- Graph Processing Algorithms on top of Neo4j☆39Updated 8 years ago
- A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It…☆202Updated 5 years ago
- HBase as the backing store for the TF-IDF representations for Lucene☆109Updated 15 years ago
- The next generation of open source search☆93Updated 8 years ago
- Using latent Dirichlet allocation (LDA) in Apache Lucene☆57Updated 13 years ago