CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Mar 12, 2026Updated 3 months ago
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Role based access control☆14Sep 24, 2021Updated 4 years ago
- Demonstrations of markdown presentation features to the GitPitch community.☆10Jul 24, 2019Updated 6 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 12 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆37Aug 12, 2018Updated 7 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Rainfall is an extensible java framework to implement custom DSL based stress and performance tests☆12Mar 31, 2026Updated 2 months ago
- Code for the paper Faster Phrase-Based Decoding by Refining Feature State☆14Jan 9, 2023Updated 3 years ago
- Java Alerting Framework for ElasticSearch☆12May 20, 2016Updated 10 years ago
- Several scripts to analyse Wikidata dumps☆33Apr 7, 2014Updated 12 years ago
- AWS Lambda precomiled binaries for lxml 3.6.4 built for python 2.7 and python 3.6 runtimes☆12Apr 17, 2019Updated 7 years ago
- Index Common Crawl archives in tabular format☆128Jun 4, 2026Updated last week
- A simple Node.js wrapper for the BitX API.☆11Jun 23, 2022Updated 3 years ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 11 years ago
- PredictionIO Node SDK☆63Aug 13, 2014Updated 11 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- XPath extension for extraction from interactive web sites. NOTE: This code is currently out of sync. A more recent, but precompiled versi…☆25Feb 27, 2013Updated 13 years ago
- Crossplatform dock icon implementation☆28Mar 6, 2014Updated 12 years ago
- ☆15May 20, 2026Updated 3 weeks ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆66Aug 5, 2016Updated 9 years ago
- Set of scripts to aid in the download of the GDELT data files from gdelt.utdallas.edu☆18May 14, 2014Updated 12 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆224Dec 22, 2022Updated 3 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Apr 26, 2021Updated 5 years ago
- ☆20Jan 19, 2019Updated 7 years ago
- PETRARCH actor, agent and verb dictionaries☆22Aug 3, 2018Updated 7 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- JUnitBenchmarks (git clone of the head SVN)☆58Apr 13, 2015Updated 11 years ago
- ☆15Jul 1, 2025Updated 11 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Sep 13, 2017Updated 8 years ago
- Lambda Function to extract EXIF data from images uploaded to an S3 bucket and store it in DynamoDB.☆15Aug 17, 2018Updated 7 years ago
- Materiały do szkolenia: HADOOP Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family☆17Oct 18, 2022Updated 3 years ago
- A whirlwind tour of Common Crawl's data using Python☆45Apr 13, 2026Updated 2 months ago
- Kylo integration with PDND (previously DAF).☆19Nov 16, 2022Updated 3 years ago
- ☆33Nov 14, 2013Updated 12 years ago
- ☆14Feb 22, 2015Updated 11 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Windows 32/64-bit Include files and Import Libraries☆16May 26, 2022Updated 4 years ago
- Write Like Hemingway☆12Nov 28, 2014Updated 11 years ago
- Add interfaces to classes generated by other plugins☆16Jul 1, 2023Updated 2 years ago
- Two simple cheatsheets about graphs in SPARQL☆15May 29, 2020Updated 6 years ago
- Process Common Crawl data with Python and Spark☆455Mar 26, 2026Updated 2 months ago
- ☆15Aug 15, 2012Updated 13 years ago
- ☆23Jul 8, 2025Updated 11 months ago