Statistics of Common Crawl monthly archives mined from URL index files
☆215Apr 7, 2026Updated last week
Alternatives and similar repositories for cc-crawl-statistics
Users that are interested in cc-crawl-statistics are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Various Jupyter notebooks about Common Crawl data☆65Nov 22, 2025Updated 4 months ago
- Tools to construct and process Common Crawl webgraphs☆107Mar 26, 2026Updated 2 weeks ago
- Index Common Crawl archives in tabular format☆127Mar 20, 2026Updated 3 weeks ago
- Process Common Crawl data with Python and Spark☆453Mar 26, 2026Updated 2 weeks ago
- ☆25Mar 20, 2024Updated 2 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Single server/laptop grade file-observatory☆10Mar 30, 2023Updated 3 years ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Jan 18, 2026Updated 2 months ago
- Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...☆322Dec 9, 2023Updated 2 years ago
- Voluntary recognitions of unions known to the NLRB☆12Nov 2, 2024Updated last year
- Gathers urls from common crawl☆34Nov 9, 2019Updated 6 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆204Oct 7, 2018Updated 7 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆201Mar 23, 2026Updated 3 weeks ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 2 months ago
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Feb 27, 2026Updated last month
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Deployment of pywb as a CommonCrawl Index Server☆21Oct 6, 2017Updated 8 years ago
- URS Benchmark: Evaluating LLMs on User Reported Scenarios☆30May 30, 2025Updated 10 months ago
- Scientific articles using or citing Common Crawl data☆28Mar 19, 2026Updated 3 weeks ago
- implementation of dualformer☆25Mar 1, 2025Updated last year
- Common Crawl Index Server☆71Feb 28, 2025Updated last year
- Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").☆20Jun 16, 2025Updated 9 months ago
- ☆16Aug 10, 2022Updated 3 years ago
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆69Jan 7, 2026Updated 3 months ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆110Apr 4, 2025Updated last year
- A robust web archive analytics toolkit☆136Updated this week
- 🕸 GlotCC Dataset and Pipline -- NeurIPS 2024☆20Apr 6, 2025Updated last year
- Rust bindings to libpostal☆14Mar 28, 2022Updated 4 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- DEPRECATED. Replaced with Electron desktop application: https://github.com/bulk-reviewer/bulk-reviewer☆13Apr 16, 2019Updated 6 years ago
- Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows☆19Nov 4, 2025Updated 5 months ago
- ☆10Sep 11, 2021Updated 4 years ago
- A persistent repository for PRONOM Research Week activities☆12May 26, 2021Updated 4 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resource…☆27Feb 16, 2026Updated last month
- A set of reusable Java components that implement functionality common to any web crawler☆255Feb 26, 2026Updated last month
- Wrapper around hfsutils to generate DFXML for HFS-formatted disk images☆11Apr 20, 2018Updated 7 years ago
- [ICCV 2025] Dynamic-VLM☆28Dec 16, 2024Updated last year
- A whirlwind tour of Common Crawl's data using Python☆38Apr 1, 2026Updated last week
- A collection of scripts and tools for analyzing SWE agents.☆16May 7, 2025Updated 11 months ago
- Core Python Web Archiving Toolkit for replay and recording of web archives☆1,640Apr 7, 2026Updated last week