commoncrawl/cc-pyspark

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-pyspark)

commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark

☆457

Alternatives and similar repositories for cc-pyspark

Users that are interested in cc-pyspark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cc-notebooks
View on GitHub
Various Jupyter notebooks about Common Crawl data
☆66Jul 3, 2026Updated 3 weeks ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
commoncrawl / cdx_toolkit
View on GitHub
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆208Jun 24, 2026Updated last month
webrecorder / warcio
View on GitHub
Streaming WARC/ARC library for fast web archive IO
☆462Jun 10, 2026Updated last month
commoncrawl / cc-webgraph
View on GitHub
Tools to construct and process Common Crawl webgraphs
☆111Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
commoncrawl / cc-crawl-statistics
View on GitHub
Statistics of Common Crawl monthly archives mined from URL index files
☆227Updated this week
qburst / common-crawl-malayalam
View on GitHub
Useful tools to extract malayalam text from the Common Crawl Datasets
☆28Jul 10, 2026Updated 2 weeks ago
ikreymer / cdx-index-client
View on GitHub
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆203Oct 7, 2018Updated 7 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,047Apr 25, 2023Updated 3 years ago
ilinguistics / common_crawl_corpus
View on GitHub
Scripts for building a geo-located web corpus using Common Crawl data
☆11Jan 18, 2026Updated 6 months ago
commoncrawl / cc-citations
View on GitHub
Scientific articles using or citing Common Crawl data
☆29Jul 8, 2026Updated 2 weeks ago
trendsci / linkrun
View on GitHub
LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship
☆38Apr 2, 2020Updated 6 years ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 3 weeks ago
CI-Research / KeywordAnalysis
View on GitHub
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Jan 28, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
shjwudp / c4-dataset-script
View on GitHub
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…
☆136Jun 7, 2023Updated 3 years ago
jpbruinsslot / warc3
View on GitHub
Python 3 library for reading and writing warc files
☆21Jan 29, 2018Updated 8 years ago
oscar-project / goclassy
View on GitHub
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
☆86Apr 21, 2021Updated 5 years ago
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
webrecorder / pywb
View on GitHub
Core Python Web Archiving Toolkit for replay and recording of web archives
☆1,684Apr 10, 2026Updated 3 months ago
internetarchive / surt
View on GitHub
Sort-friendly URI Reordering Transform (SURT) python module
☆45Sep 11, 2025Updated 10 months ago
dylanzenner / business_closures_de_pipeline
View on GitHub
Data Engineering pipeline hosted entirely in the AWS ecosystem utilizing DocumentDB as the database
☆14Oct 26, 2021Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
commoncrawl / nutch
View on GitHub
Common Crawl fork of Apache Nutch
☆42Jul 18, 2026Updated last week
commoncrawl / gzipstream
View on GitHub
gzipstream allows Python to process multi-part gzip files from a streaming source
☆23Feb 24, 2017Updated 9 years ago
mapio / py-web-graph
View on GitHub
A simple package allowing to use WebGraph data in Python (via the Jython interpreter).
☆20Oct 21, 2020Updated 5 years ago
ukwa / webarchive-discovery
View on GitHub
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…
☆133Nov 21, 2025Updated 8 months ago
DocNow / waybackprov
View on GitHub
utility to fetch provenance information from Internet Archive's Wayback Machine
☆15Feb 5, 2026Updated 5 months ago
fhamborg / news-please
View on GitHub
news-please - an integrated web crawler and information extractor for news that just works
☆2,472Apr 14, 2026Updated 3 months ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
rom1504 / cc2dataset
View on GitHub
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
☆321Dec 9, 2023Updated 2 years ago
helgeho / ArchiveSpark
View on GitHub
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…
☆161Oct 8, 2025Updated 9 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Damian89 / commonCrawlParser
View on GitHub
Simple multi threaded tool to extract domain related data from commoncrawl.org
☆31Jul 17, 2018Updated 8 years ago
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
UAlbanyArchives / describingWebArchives
View on GitHub
Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs
☆11Aug 10, 2018Updated 7 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
tatsu-lab / mlm_inductive_bias
View on GitHub
Code Release for "On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies"
☆16Apr 13, 2021Updated 5 years ago
internetarchive / warc
View on GitHub
Python library for reading and writing warc files
☆249Mar 7, 2022Updated 4 years ago
rushter / selectolax
View on GitHub
Python binding to Modest and Lexbor engines. Fast HTML5 parser with CSS selectors for Python.
☆1,658Jul 15, 2026Updated last week