commoncrawl / commoncrawl-crawlerLinks
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆222Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆246Updated last month
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Elasticsearch plugin for b-bit minhash algorism☆62Updated last year
- Sample code, data, and configuration for the book☆188Updated 4 years ago
- Machine learning components for Apache UIMA☆131Updated 2 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 3 months ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Storm / Solr Integration☆19Updated last year
- A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It…☆201Updated 5 years ago
- command line tool for Apache Lucene☆163Updated 3 months ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 6 years ago
- This provides tools for b-bit MinHash algorism.☆36Updated 5 months ago
- Lucene Auto Phrase TokenFilter implementation☆59Updated 7 years ago
- JAVA implementation of Multinomial Naive Bayes Text Classifier.☆96Updated 11 years ago
- Building recommenders with Elastic Graph!☆37Updated 5 years ago
- Distributed Realtime Search with Lucene and MongoDB☆60Updated 7 years ago
- SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.☆427Updated 9 years ago
- Elasticsearch Index Termlist☆118Updated 6 years ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Katta - distributed Lucene☆60Updated 12 years ago
- Distributed processing framework for search solutions☆82Updated 2 years ago
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 4 years ago
- Customer Product search clicks analytics using big data Hadoop, Hive, Oozie, ElasticSearch, Akka, Spring Data☆73Updated 3 years ago
- ☆265Updated 9 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- Mirror of Apache Lens☆62Updated 5 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 5 years ago
- distributed realtime searchable database☆117Updated 11 years ago
- Bloom filters for Java☆66Updated last year