The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆224Dec 22, 2022Updated 3 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- FoGFaaS: Add serverless computing (faas) to ifogsim☆22Mar 30, 2025Updated 11 months ago
- Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…☆142Jul 7, 2022Updated 3 years ago
- Simple Samza Job Using Confluent Platform☆14Apr 14, 2016Updated 9 years ago
- DistributeCrawler的Maven版☆10Jun 20, 2022Updated 3 years ago
- Blog crawler for the blogforever project.☆23Jan 31, 2014Updated 12 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated last week
- A set of reusable Java components that implement functionality common to any web crawler☆254Feb 26, 2026Updated 3 weeks ago
- News crawling with StormCrawler - stores content as WARC☆364Feb 19, 2025Updated last year
- SKOS Support for Apache Lucene and Solr☆56May 12, 2021Updated 4 years ago
- Graph algorithms implemented in GraphX and Spark styles☆15Apr 26, 2015Updated 10 years ago
- Rails helpers for outputting preloading/prefetching metadata.☆19Jul 8, 2018Updated 7 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Distributed Realtime Search with Lucene and MongoDB☆60May 14, 2018Updated 7 years ago
- A fork of cascading patterns, but implemented for trident☆71Dec 16, 2023Updated 2 years ago
- Camus Compressor merges files created by Camus and saves them in a compressed format.☆13Mar 20, 2023Updated 3 years ago
- WikiPBX is an open source PBX web interface for FreeSWITCH. WikiPBX is written in python and uses the Django web application framework. C…☆10May 30, 2014Updated 11 years ago
- A distributed key-value storage system built over RocksDB☆15Dec 30, 2016Updated 9 years ago
- Source for Reactive Architecture: Beyond the Basics online training course☆16Jul 26, 2017Updated 8 years ago
- A project meant to be a reference to get started building indexes for Solr with Hadoop's map-reduce.☆50Nov 9, 2016Updated 9 years ago
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- Static site generator for Linked Data☆12Jan 24, 2026Updated last month
- A library for financial and time series calculations on Apache Spark☆28Feb 2, 2016Updated 10 years ago
- Tool for visualizing Apache Oozie pipelines☆12Feb 15, 2016Updated 10 years ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 10 years ago
- This is a basic instance of the D-Net software toolkit, a software framework for the realization of aggregative data infrastructures.☆15Jan 27, 2022Updated 4 years ago
- Collect and disseminate information on fee-based Open Access publishing in Sweden☆11Mar 13, 2026Updated last week
- A unitypackaged mirror of Moq, for use in Unity3D☆17Aug 28, 2013Updated 12 years ago
- The Scholix metadata schema is a set of properties describing a Link Information Package, which carries information about a link between …☆17Mar 14, 2022Updated 4 years ago
- Java EE Cache Filter☆36Mar 15, 2019Updated 7 years ago
- Collection of generic Apache Flink operators☆17May 15, 2017Updated 8 years ago
- The hub for all JATS4R meeting notes, examples, draft recommendations, documents, and issues.☆17Sep 8, 2019Updated 6 years ago
- A java library for creating standalone, portable, schema-full object databases supporting pagination and faceted search, and offering str…☆17Mar 17, 2017Updated 9 years ago
- DATS JSON schemas☆13Dec 21, 2022Updated 3 years ago
- Start of an Internet draft on the separation between HTTP's semantic layer, framing layer(s), and the underlying transport layer.☆15Mar 22, 2016Updated 9 years ago
- A example implementation react native vision camera.☆15Mar 7, 2022Updated 4 years ago
- A distributed generic query layer for Apache Kafka Interactive Queries☆26Nov 8, 2017Updated 8 years ago
- conbine flume,spark-streaming and redis for real-time computing☆22Oct 20, 2014Updated 11 years ago
- Autoproxy automatically detects proxies and stores them in the respective environment variables (e.g. http_proxy).☆13Oct 2, 2016Updated 9 years ago
- Spring MVC + Mustache example☆15Jun 27, 2016Updated 9 years ago