commoncrawl / commoncrawl-crawlerLinks
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆216Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆282Updated 7 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 4 years ago
- HBase as the backing store for the TF-IDF representations for Lucene☆108Updated 15 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.☆425Updated 9 years ago
- Apache OpenNLP Sandbox☆43Updated this week
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 3 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago
- The next generation of open source search☆92Updated 8 years ago
- ☆18Updated 8 years ago
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 5 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- Educational Examle of a custom Lucene Query & Scorer☆48Updated 5 years ago
- Mirror of Apache Lens☆60Updated 5 years ago
- Katta - distributed Lucene☆60Updated 12 years ago
- SIREn - Semi-Structured Information Retrieval Engine☆107Updated 4 years ago
- Storm / Solr Integration☆19Updated last year
- Custom graph algorithms for Neo4j with own Java and REST APIs☆35Updated 8 years ago
- distributed realtime searchable database☆117Updated 10 years ago
- NLP tools developed by Emory University.☆60Updated 8 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Distributed processing framework for search solutions☆81Updated 2 years ago
- Storm JMS Integration☆78Updated 2 years ago
- The Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning☆134Updated 4 years ago
- Apache Joshua☆107Updated 4 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 2 weeks ago
- Set of Hadoop, Spark and Storm based tools for web and customer analytic☆34Updated 4 years ago
- Graph Processing Algorithms on top of Neo4j☆39Updated 7 years ago
- Using latent Dirichlet allocation (LDA) in Apache Lucene☆58Updated 12 years ago
- Trident-ML : A realtime online machine learning library☆382Updated last year