Statistics of Common Crawl monthly archives mined from URL index files
☆219Apr 27, 2026Updated last week
Alternatives and similar repositories for cc-crawl-statistics
Users that are interested in cc-crawl-statistics are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Various Jupyter notebooks about Common Crawl data☆66Nov 22, 2025Updated 5 months ago
- Tools to construct and process Common Crawl webgraphs☆108Updated this week
- Index Common Crawl archives in tabular format☆128Updated this week
- Process Common Crawl data with Python and Spark☆454Mar 26, 2026Updated last month
- ☆25Mar 20, 2024Updated 2 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- The pipeline for the OSCAR corpus☆177Nov 9, 2025Updated 5 months ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Jan 18, 2026Updated 3 months ago
- Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs☆11Aug 10, 2018Updated 7 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆225Dec 22, 2022Updated 3 years ago
- Latex Beamer Theme☆16Apr 25, 2025Updated last year
- Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...☆321Dec 9, 2023Updated 2 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆204Mar 23, 2026Updated last month
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- [WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages☆17Apr 14, 2026Updated 3 weeks ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Dec 4, 2017Updated 8 years ago
- A polite and user-friendly downloader for Common Crawl data☆79Apr 24, 2026Updated last week
- Digital Forensics XML packages in Python☆18Jan 20, 2026Updated 3 months ago
- URS Benchmark: Evaluating LLMs on User Reported Scenarios☆31May 30, 2025Updated 11 months ago
- ☆14Jan 3, 2024Updated 2 years ago
- ☆13Nov 28, 2025Updated 5 months ago
- Implementation of our Delaunay based rough/multi-stroke sketches simplification work☆12Jan 18, 2020Updated 6 years ago
- Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").☆20Jun 16, 2025Updated 10 months ago
- ☆17Aug 9, 2025Updated 8 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Deploy and Scale LLM-based applications☆26Jun 15, 2023Updated 2 years ago
- A robust web archive analytics toolkit☆137Apr 28, 2026Updated last week
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Feb 6, 2024Updated 2 years ago
- 💬A curated list of incredible amount of publications related to Dialogue Systems especially Chatbots and Chit-chat Systems☆10Dec 5, 2019Updated 6 years ago
- Rust bindings to libpostal☆14Mar 28, 2022Updated 4 years ago
- Electronic records accessioning workflow☆10Apr 24, 2017Updated 9 years ago
- Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows☆20Nov 4, 2025Updated 6 months ago
- No pain HTML parsing library.☆12Apr 2, 2018Updated 8 years ago
- Blog post☆17Feb 16, 2024Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- A persistent repository for PRONOM Research Week activities☆12May 26, 2021Updated 4 years ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resource…☆27Feb 16, 2026Updated 2 months ago
- A set of reusable Java components that implement functionality common to any web crawler☆256Apr 27, 2026Updated last week
- Wrapper around hfsutils to generate DFXML for HFS-formatted disk images☆11Apr 20, 2018Updated 8 years ago
- [ICCV 2025] Dynamic-VLM☆28Dec 16, 2024Updated last year
- Common Crawl fork of Apache Nutch☆41Apr 20, 2026Updated 2 weeks ago
- ☆16Apr 12, 2024Updated 2 years ago