commoncrawl/cc-crawl-statistics

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-crawl-statistics)

commoncrawl / cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

☆226

Alternatives and similar repositories for cc-crawl-statistics

Users that are interested in cc-crawl-statistics are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cc-notebooks
View on GitHub
Various Jupyter notebooks about Common Crawl data
☆66Jul 3, 2026Updated 3 weeks ago
commoncrawl / cc-webgraph
View on GitHub
Tools to construct and process Common Crawl webgraphs
☆111Updated this week
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
tballison / file-observatory
View on GitHub
Single server/laptop grade file-observatory
☆10Mar 30, 2023Updated 3 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
UAlbanyArchives / describingWebArchives
View on GitHub
Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs
☆11Aug 10, 2018Updated 7 years ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
mnm-team / latex-beamer
View on GitHub
Latex Beamer Theme
☆18Apr 25, 2025Updated last year
rom1504 / cc2dataset
View on GitHub
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
☆321Dec 9, 2023Updated 2 years ago
dfxml-working-group / dfxml_python
View on GitHub
Digital Forensics XML packages in Python
☆18May 8, 2026Updated 2 months ago
ikreymer / cdx-index-client
View on GitHub
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆203Oct 7, 2018Updated 7 years ago
commoncrawl / web-languages
View on GitHub
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …
☆71Jul 1, 2026Updated 3 weeks ago
CI-Research / KeywordAnalysis
View on GitHub
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Jan 28, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated 3 weeks ago
commoncrawl / whirlwind-python
View on GitHub
A whirlwind tour of Common Crawl's data using Python
☆45Jun 15, 2026Updated last month
lucanag / emotet
View on GitHub
☆10Sep 11, 2021Updated 4 years ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
akanshajainn / K-means-Clustering-on-Text-Documents
View on GitHub
Using Scikit-learn, machine learning library for the Python programming language.
☆14Apr 5, 2018Updated 8 years ago
commoncrawl / cc-index-server
View on GitHub
Common Crawl Index Server
☆71Feb 28, 2025Updated last year
philschmid / deep-learning-remote-runner
View on GitHub
☆16Aug 10, 2022Updated 3 years ago
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
kermitt2 / biblio-glutton-extension
View on GitHub
A browser extension providing Open Access bibliographical services
☆18Dec 9, 2022Updated 3 years ago
transducens / linguacrawl
View on GitHub
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆11Feb 6, 2024Updated 2 years ago
radareorg / radare2-skel
View on GitHub
Sample radare2 project templates
☆16Jan 23, 2026Updated 6 months ago
commoncrawl / nutch
View on GitHub
Common Crawl fork of Apache Nutch
☆42Updated this week
huggingface / olm-datasets
View on GitHub
Pipeline for pulling and processing online language model pretraining data from the web
☆179Jul 31, 2023Updated 2 years ago
Zyphra / Zyda_processing
View on GitHub
☆44Jun 19, 2024Updated 2 years ago
newsreader / eso-and-ceo
View on GitHub
Events and Situations Ontology
☆14Apr 20, 2018Updated 8 years ago
webrecorder / pywb
View on GitHub
Core Python Web Archiving Toolkit for replay and recording of web archives
☆1,683Apr 10, 2026Updated 3 months ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
noanabeshima / wikipedia-downloader
View on GitHub
Downloads 2020 English Wikipedia articles as plaintext
☆27Mar 25, 2023Updated 3 years ago
RockefellerArchiveCenter / dm_log
View on GitHub
Inventory digital media items and log disk imaging
☆12Jun 10, 2022Updated 4 years ago
OpenHands / agent-analysis
View on GitHub
A collection of scripts and tools for analyzing SWE agents.
☆16May 7, 2025Updated last year
openeventdata / phoenix_pipeline
View on GitHub
Turning news into events since 2014.
☆52May 1, 2017Updated 9 years ago
drewgendreau / Socrata.jl
View on GitHub
Julia API for accessing Socrata open data sets
☆15Jul 25, 2014Updated 11 years ago
csirt-tooling-org / csirt-tooling-best-practices
View on GitHub
CSIRT Tooling: Best Practices in Developing, Maintaining and Distributing Open Source Tools
☆16Feb 26, 2026Updated 4 months ago
darrow-labs / LegalLens
View on GitHub
☆10Jul 15, 2024Updated 2 years ago