CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Mar 12, 2026Updated last month
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Common web archive utility code.☆63Apr 1, 2026Updated last month
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- Demonstrations of markdown presentation features to the GitPitch community.☆10Jul 24, 2019Updated 6 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 11 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 3 months ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- FPGA code for NeTV2☆16Dec 3, 2018Updated 7 years ago
- Discover how you can migrate from traditional deployments to serverless architectures with AWS☆12Feb 1, 2019Updated 7 years ago
- ☆15Dec 1, 2021Updated 4 years ago
- Code for the paper Faster Phrase-Based Decoding by Refining Feature State☆14Jan 9, 2023Updated 3 years ago
- AWS Lambda precomiled binaries for lxml 3.6.4 built for python 2.7 and python 3.6 runtimes☆12Apr 17, 2019Updated 7 years ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 11 years ago
- XPath extension for extraction from interactive web sites. NOTE: This code is currently out of sync. A more recent, but precompiled versi…☆27Feb 27, 2013Updated 13 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- ☆15Updated this week
- This repository contains a series of 4 jupyter notebooks demonstrating how AWS AI Services like Amazon Rekognition, Amazon Transcribe and…☆12Nov 26, 2021Updated 4 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Set of scripts to aid in the download of the GDELT data files from gdelt.utdallas.edu☆18May 14, 2014Updated 11 years ago
- Efficient, distributed downloads of large files from S3 to HDFS using Spark.☆17Apr 26, 2017Updated 9 years ago
- A SPARQL client for Amazon Neptune that includes AWS Signature Version 4 signing. Implemented as an RDF4J repository.☆23Mar 2, 2026Updated 2 months ago
- The Dynamic Rules Engine is a serverless application that enables real-time evaluation of rules against sensor data, leveraging AWS Kines…☆11Sep 25, 2024Updated last year
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆225Dec 22, 2022Updated 3 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆20Jan 19, 2019Updated 7 years ago
- All future small examples and projects of itbackyard blog will be added to this reop☆10Jan 16, 2024Updated 2 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆35Dec 16, 2023Updated 2 years ago
- PETRARCH actor, agent and verb dictionaries☆22Aug 3, 2018Updated 7 years ago
- AI-powered YouTube video analysis toolkit using MCP. Extract transcripts, generate knowledge graphs, generate high-quality detailed note…☆14Jul 5, 2025Updated 9 months ago
- ReactJS frontend that interacts with the Bodhi backend services☆18Sep 3, 2019Updated 6 years ago
- ☆14Jul 1, 2025Updated 10 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Sep 13, 2017Updated 8 years ago
- Materiały do szkolenia: HADOOP Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family☆17Oct 18, 2022Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- A whirlwind tour of Common Crawl's data using Python☆44Apr 13, 2026Updated 2 weeks ago
- Pure JAX-RS 2.0 ClientRequestFilter/WriterInterceptor used to sign AWS REST requests. Also has presign capabilities.☆15Jan 4, 2022Updated 4 years ago
- Write Like Hemingway☆12Nov 28, 2014Updated 11 years ago
- ☆23Jul 8, 2025Updated 9 months ago
- ☆14Jun 13, 2024Updated last year
- Stupid Experiments in Elasticsearch Image Search☆14Oct 18, 2019Updated 6 years ago
- REST API using: Spring Boot + Hibernate + MySQL + Jackson + Retrofit☆10Jan 22, 2016Updated 10 years ago