commoncrawl / commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆213Updated 2 years ago
Alternatives and similar repositories for commoncrawl-crawler:
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- distributed realtime searchable database☆117Updated 10 years ago
- Storm / Solr Integration☆19Updated last year
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- A Real-Time Analytical Processing (RTAP) example using Spark/Shark☆51Updated 11 years ago
- Katta - distributed Lucene☆60Updated 12 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 7 years ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 5 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- Jetstream Esper Processor implementation☆23Updated 9 years ago
- SIREn - Semi-Structured Information Retrieval Engine☆107Updated 3 years ago
- Machine learning components for Apache UIMA☆129Updated last year
- A library for financial and time series calculations on Apache Spark☆28Updated 9 years ago
- Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…☆142Updated 2 years ago
- Analyzing Twitter real time feed with Spark Streaming☆32Updated 10 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Set of real time stream processing algorithms that can be used by big data streaming platform☆72Updated 4 years ago
- A Stanford CoreNLP server, with example clients, using Apache Thrift.☆47Updated 6 years ago
- Repackaging of Boilerpipe published on Maven Central Repository.☆53Updated last year
- NLP tools developed by Emory University.☆60Updated 8 years ago
- ☆264Updated 9 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- Educational Examle of a custom Lucene Query & Scorer☆48Updated 5 years ago
- Mirror of Apache DirectMemory☆52Updated last year
- Set of Hadoop, Spark and Storm based tools for web and customer analytic☆34Updated 3 years ago
- Spark CEP is an extension of Spark Streaming to support SQL-based query processing☆56Updated 8 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 5 years ago
- Lucene Auto Phrase TokenFilter implementation☆59Updated 6 years ago
- Analytic UIMA pipelines using Spark☆23Updated 9 years ago
- command line tool for Apache Lucene☆162Updated 3 weeks ago