commoncrawl / commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆212Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler:
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
- Elasticsearch Index Termlist☆117Updated 5 years ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 5 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 6 years ago
- An efficient and flexible token-based regular expression language and engine.☆75Updated 10 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆120Updated 11 months ago
- distributed realtime searchable database☆116Updated 10 years ago
- HBase as the backing store for the TF-IDF representations for Lucene☆108Updated 14 years ago
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 10 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 4 years ago
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 4 years ago
- NLP tools developed by Emory University.☆60Updated 8 years ago
- Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…☆142Updated 2 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- Fusion demo app searching open-source project data from the Apache Software Foundation☆42Updated 6 years ago
- Lucene Auto Phrase TokenFilter implementation☆59Updated 6 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 4 years ago
- SIREn - Semi-Structured Information Retrieval Engine☆107Updated 3 years ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- Katta - distributed Lucene☆60Updated 11 years ago
- A library for financial and time series calculations on Apache Spark☆28Updated 8 years ago
- Elasticsearch Latent Semantic Indexing experimentation☆33Updated 5 years ago
- Distributed processing framework for search solutions☆81Updated 2 years ago
- A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata☆28Updated 6 years ago
- command line tool for Apache Lucene☆160Updated 5 months ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- Using latent Dirichlet allocation (LDA) in Apache Lucene☆58Updated 12 years ago
- Machine learning components for Apache UIMA☆129Updated last year
- Elasticsearch plugin for b-bit minhash algorism☆62Updated 7 months ago
- ☆40Updated 9 years ago