commoncrawl/cc-downloader

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-downloader)

commoncrawl / cc-downloader

A polite and user-friendly downloader for Common Crawl data

☆86

Alternatives and similar repositories for cc-downloader

Users that are interested in cc-downloader are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

thunderpoot / scdx
View on GitHub
A simple tool for querying the Common Crawl CDX
☆16Jan 10, 2026Updated 6 months ago
commoncrawl / cc-citations
View on GitHub
Scientific articles using or citing Common Crawl data
☆29Jul 8, 2026Updated 2 weeks ago
nlnwa / gowarcserver
View on GitHub
☆17Mar 31, 2025Updated last year
bitextor / warc2text
View on GitHub
Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated 3 months ago
anjackson / sliver
View on GitHub
A tool for collection archival slivers of the web and web archives
☆19Jun 1, 2026Updated last month
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
hopsparser / hopsparser
View on GitHub
A neural dependency parser that does its best
☆17Mar 6, 2026Updated 4 months ago
iipc / warc2html
View on GitHub
Converts WARC files to static HTML
☆59Sep 18, 2025Updated 10 months ago
mediacloud / metadata-lib
View on GitHub
How Media Cloud approaches extracting metadata from online news stories
☆17Apr 15, 2026Updated 3 months ago
peshkira / c3po
View on GitHub
Clever, Crafty Content Profiling of Objects
☆20Jan 6, 2022Updated 4 years ago
vphill / web-archiving-course
View on GitHub
Web Archiving Course
☆23Mar 4, 2024Updated 2 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
kermitt2 / arxiv_harvester
View on GitHub
Poor man's simple harvester for arXiv resources
☆14Jul 14, 2023Updated 3 years ago
harvard-lil / waczerciser
View on GitHub
Create and edit WARC and WACZ files
☆29Dec 6, 2024Updated last year
muffinista / wayback_exe
View on GitHub
code for twitter bot @wayback_exe
☆49Sep 24, 2025Updated 9 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
commoncrawl / whirlwind-python
View on GitHub
A whirlwind tour of Common Crawl's data using Python
☆45Jun 15, 2026Updated last month
kermitt2 / biblio-glutton-extension
View on GitHub
A browser extension providing Open Access bibliographical services
☆18Dec 9, 2022Updated 3 years ago
archivesunleashed / auk
View on GitHub
Rails application for the Archives Unleashed Cloud.
☆11Jun 30, 2021Updated 5 years ago
The-AI-Alliance / open-trusted-data-initiative
View on GitHub
Working repo to support the Alliance's Open Trusted Data Initiative
☆15Jul 1, 2026Updated 2 weeks ago
edsu / memento-cli
View on GitHub
A command line utility for listing and searching snapshots in web archives
☆20Jun 4, 2026Updated last month
jedireza / warc
View on GitHub
A Rust library for reading and writing WARC files
☆60Nov 27, 2024Updated last year
DocNow / awesome-social-media-archiving
View on GitHub
Tools for helping you work with web platform archive downloads.
☆18Mar 27, 2020Updated 6 years ago
laurieburchell / open-lid-dataset
View on GitHub
Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)
☆77Apr 1, 2025Updated last year
web-archive-group / hackathon
View on GitHub
☆14Feb 28, 2017Updated 9 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
kpu / fasterText
View on GitHub
Library for fast text representation and classification.
☆31Jan 9, 2024Updated 2 years ago
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago
recursal / minmodmon
View on GitHub
Mini Model Daemon
☆13Nov 9, 2024Updated last year
maturban / WARCMerge
View on GitHub
Merging WARCs into a single WARC file
☆15Aug 29, 2014Updated 11 years ago
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
webrecorder / public-web-archives
View on GitHub
A listing of world wide web archives, for humans and machines using Web Archive Manifest (WAM) yaml format
☆55Dec 5, 2022Updated 3 years ago
bethelmelesse / UnifiedCrawl
View on GitHub
☆17Nov 26, 2024Updated last year
allenai / decon
View on GitHub
decontamination
☆35Mar 4, 2026Updated 4 months ago
JohnMarkOckerbloom / onlinebooks
View on GitHub
Selected code and data for The Online Books Page and related applications
☆12Jul 1, 2026Updated 2 weeks ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
rajbot / CDX-Writer
View on GitHub
Python script to create CDX index files of WARC data
☆16Sep 7, 2018Updated 7 years ago
adulau / napkin-text-analysis
View on GitHub
Napkin is a simple tool to produce statistical analysis of a text
☆12Feb 25, 2024Updated 2 years ago
commoncrawl / cdx_toolkit
View on GitHub
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆208Jun 24, 2026Updated 3 weeks ago
peterk / munin-indexer
View on GitHub
A social media open post web archiving tool
☆26Feb 4, 2026Updated 5 months ago
SYSTRAN / similarity
View on GitHub
Bilingual sentence similarity classifier using Tensorflow
☆24Sep 26, 2019Updated 6 years ago
chryzsh / GPTCommentDetector
View on GitHub
A UserScript to detect GPT generated comments on Hackernews.
☆13Dec 10, 2022Updated 3 years ago
acidvegas / czds
View on GitHub
ICANN Centralized Zone Data Service (CZDS) Tool
☆21Mar 26, 2025Updated last year