Process Common Crawl data with Python and Spark
☆453Jan 20, 2026Updated 2 months ago
Alternatives and similar repositories for cc-pyspark
Users that are interested in cc-pyspark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Index Common Crawl archives in tabular format☆126Mar 20, 2026Updated last week
- Various Jupyter notebooks about Common Crawl data☆64Nov 22, 2025Updated 4 months ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 2 months ago
- News crawling with StormCrawler - stores content as WARC☆364Feb 19, 2025Updated last year
- Tools to construct and process Common Crawl webgraphs☆105Mar 20, 2026Updated last week
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Streaming WARC/ARC library for fast web archive IO☆452Updated this week
- Statistics of Common Crawl monthly archives mined from URL index files☆212Mar 19, 2026Updated last week
- Useful tools to extract malayalam text from the Common Crawl Datasets☆28Dec 11, 2024Updated last year
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆205Oct 7, 2018Updated 7 years ago
- Common web archive utility code.☆63Mar 2, 2026Updated 3 weeks ago
- Tools to download and cleanup Common Crawl data☆1,038Apr 25, 2023Updated 2 years ago
- A robust web archive analytics toolkit☆134Oct 15, 2025Updated 5 months ago
- Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…☆135Jun 7, 2023Updated 2 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Scripts for building a geo-located web corpus using Common Crawl data☆11Jan 18, 2026Updated 2 months ago
- Index URLs in Common Crawl☆197Sep 19, 2017Updated 8 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 2 weeks ago
- super-Django-CC is a simle web interface for commoncrawl.org☆15Dec 8, 2022Updated 3 years ago
- Exploring Common-Crawl using Python and DynamoDB☆33Oct 26, 2017Updated 8 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.☆86Apr 21, 2021Updated 4 years ago
- Front End process for the Perseo CEP☆16Mar 6, 2026Updated 3 weeks ago
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Core Python Web Archiving Toolkit for replay and recording of web archives☆1,637Updated this week
- Sort-friendly URI Reordering Transform (SURT) python module☆45Sep 11, 2025Updated 6 months ago
- Common Crawl Index Server☆71Feb 28, 2025Updated last year
- Web archiving utility library☆11Mar 11, 2026Updated 2 weeks ago
- ☆25Mar 20, 2024Updated 2 years ago
- Common Crawl fork of Apache Nutch☆40Updated this week
- Load WARC files into Apache Spark with sparklyr☆12Jan 11, 2022Updated 4 years ago
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 4 months ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆14Feb 5, 2026Updated last month
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- news-please - an integrated web crawler and information extractor for news that just works☆2,401Sep 21, 2025Updated 6 months ago
- Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...☆320Dec 9, 2023Updated 2 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆158Oct 8, 2025Updated 5 months ago
- Simple multi threaded tool to extract domain related data from commoncrawl.org☆31Jul 17, 2018Updated 7 years ago
- Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs☆11Aug 10, 2018Updated 7 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- Python Flask Kanban Board project☆22Jun 9, 2024Updated last year