Statistics of Common Crawl monthly archives mined from URL index files
☆212Mar 18, 2026Updated this week
Alternatives and similar repositories for cc-crawl-statistics
Users that are interested in cc-crawl-statistics are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Index Common Crawl archives in tabular format☆126Updated this week
- Process Common Crawl data with Python and Spark☆453Jan 20, 2026Updated 2 months ago
- Common web archive utility code.☆63Mar 2, 2026Updated 3 weeks ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Jan 18, 2026Updated 2 months ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆224Dec 22, 2022Updated 3 years ago
- Voluntary recognitions of unions known to the NLRB☆12Nov 2, 2024Updated last year
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆205Oct 7, 2018Updated 7 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆201Jan 23, 2026Updated 2 months ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated last month
- A polite and user-friendly downloader for Common Crawl data☆71Mar 3, 2026Updated 2 weeks ago
- A Tiny Linux Distro ~ 3MB just Tiny Kernel + Unix User Space☆18Jan 13, 2026Updated 2 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Dec 4, 2017Updated 8 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Oct 6, 2017Updated 8 years ago
- URS Benchmark: Evaluating LLMs on User Reported Scenarios☆30May 30, 2025Updated 9 months ago
- Load bioinformatics datasets into a local database☆11Apr 8, 2025Updated 11 months ago
- mist R package files☆10May 5, 2025Updated 10 months ago
- ☆14Jan 3, 2024Updated 2 years ago
- Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").☆20Jun 16, 2025Updated 9 months ago
- ☆17Aug 9, 2025Updated 7 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆69Jan 7, 2026Updated 2 months ago
- Deploy and Scale LLM-based applications☆26Jun 15, 2023Updated 2 years ago
- This repository contains code for fine-tuning the Whisper speech-to-text model.☆23Updated this week
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆110Apr 4, 2025Updated 11 months ago
- A robust web archive analytics toolkit☆134Oct 15, 2025Updated 5 months ago
- 🕸 GlotCC Dataset and Pipline -- NeurIPS 2024☆20Apr 6, 2025Updated 11 months ago
- Network exploit detection using highly accurate pre-trained deep neural networks with Celery + Keras + Tensorflow + Redis☆22Dec 7, 2018Updated 7 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- DEPRECATED. Replaced with Electron desktop application: https://github.com/bulk-reviewer/bulk-reviewer☆13Apr 16, 2019Updated 6 years ago
- Electronic records accessioning workflow☆10Apr 24, 2017Updated 8 years ago
- ☆10Sep 11, 2021Updated 4 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 11 years ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resource…☆27Feb 16, 2026Updated last month
- Blog post☆17Feb 16, 2024Updated 2 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆254Feb 26, 2026Updated 3 weeks ago
- Simple multi threaded tool to extract domain related data from commoncrawl.org☆31Jul 17, 2018Updated 7 years ago
- Code for the EMNLP2020 long paper "Lifelong Language Knowledge Distillation" https://arxiv.org/abs/2010.02123☆12Jul 13, 2021Updated 4 years ago
- Evaluating Reward Models in Multilingual Settings (ACL Main '25)☆41May 16, 2025Updated 10 months ago
- An alternative approach for probabilistic topic modeling based on agglomerative clustering of topics (not documents)☆12Apr 14, 2021Updated 4 years ago
- A whirlwind tour of Common Crawl's data using Python☆37Feb 17, 2026Updated last month