CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated this week
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Common web archive utility code.☆65Updated this week
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆53Jun 12, 2020Updated 6 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 12 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 5 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Events and Situations Ontology☆14Apr 20, 2018Updated 8 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Code for the paper Faster Phrase-Based Decoding by Refining Feature State☆14Jan 9, 2023Updated 3 years ago
- Video streaming battery rundown test methodology☆14Nov 6, 2019Updated 6 years ago
- Goblin OGM on top of TinkerPop 3☆11Jul 20, 2023Updated 2 years ago
- Several scripts to analyse Wikidata dumps☆33Apr 7, 2014Updated 12 years ago
- Implementation of W3C's R2RML and Direct Mapping specifications☆10Oct 12, 2020Updated 5 years ago
- A simple Node.js wrapper for the BitX API.☆11Jun 23, 2022Updated 4 years ago
- XPath extension for extraction from interactive web sites. NOTE: This code is currently out of sync. A more recent, but precompiled versi…☆25Feb 27, 2013Updated 13 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- AWS Lambda layer containing latest version of Apache Tika☆14Jun 19, 2026Updated 2 weeks ago
- ☆17Updated this week
- This repository contains a series of 4 jupyter notebooks demonstrating how AWS AI Services like Amazon Rekognition, Amazon Transcribe and…☆12Nov 26, 2021Updated 4 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆66Aug 5, 2016Updated 9 years ago
- A SPARQL client for Amazon Neptune that includes AWS Signature Version 4 signing. Implemented as an RDF4J repository.☆23May 28, 2026Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆226Dec 22, 2022Updated 3 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- ☆20Jan 19, 2019Updated 7 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆34Dec 16, 2023Updated 2 years ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- CIKM Cup 2016 (1st Place) - Track 1 - Cross Device Entity Linking☆18Sep 19, 2017Updated 8 years ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Sep 13, 2017Updated 8 years ago
- Materiały do szkolenia: HADOOP Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family☆17Oct 18, 2022Updated 3 years ago
- Pure JAX-RS 2.0 ClientRequestFilter/WriterInterceptor used to sign AWS REST requests. Also has presign capabilities.☆15Jan 4, 2022Updated 4 years ago
- ☆33Nov 14, 2013Updated 12 years ago
- [FFCV-PL] manage fast data loading with ffcv and pytorch lightning☆16Jul 17, 2023Updated 2 years ago
- Write Like Hemingway☆12Nov 28, 2014Updated 11 years ago
- Add interfaces to classes generated by other plugins☆16Jul 1, 2023Updated 3 years ago
- ☆14Mar 19, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Two simple cheatsheets about graphs in SPARQL☆15May 29, 2020Updated 6 years ago
- ☆15Aug 15, 2012Updated 13 years ago
- ☆14Jun 13, 2024Updated 2 years ago
- Stupid Experiments in Elasticsearch Image Search☆14Oct 18, 2019Updated 6 years ago
- TWS Market Data Adapter☆20May 10, 2018Updated 8 years ago
- Terraform provider to create NetAPP OCCM instances, CVO resources, volumes, snapshots, ... in Azure, AWS, GCP.☆20Jun 3, 2026Updated last month
- This repository contains sample code for ML on graph use cases using Amazon Neptune ML☆14Dec 14, 2021Updated 4 years ago