Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 2 months ago
Alternatives and similar repositories for cc-pyspark
Users that are interested in cc-pyspark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Index Common Crawl archives in tabular format☆129Updated this week
- Various Jupyter notebooks about Common Crawl data☆66Nov 22, 2025Updated 6 months ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 4 months ago
- News crawling with StormCrawler - stores content as WARC☆372Updated this week
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆209Jun 8, 2026Updated last week
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Tools to construct and process Common Crawl webgraphs☆108May 25, 2026Updated 3 weeks ago
- Streaming WARC/ARC library for fast web archive IO☆458Updated this week
- Statistics of Common Crawl monthly archives mined from URL index files☆222May 26, 2026Updated 3 weeks ago
- Useful tools to extract malayalam text from the Common Crawl Datasets☆28Apr 21, 2026Updated last month
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆203Oct 7, 2018Updated 7 years ago
- Common web archive utility code.☆64Jun 3, 2026Updated last week
- Tools to download and cleanup Common Crawl data☆1,044Apr 25, 2023Updated 3 years ago
- A robust web archive analytics toolkit☆141Updated this week
- Scientific articles using or citing Common Crawl data☆29May 26, 2026Updated 3 weeks ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…☆136Jun 7, 2023Updated 3 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- Index URLs in Common Crawl☆197Sep 19, 2017Updated 8 years ago
- Python 3 library for reading and writing warc files☆21Jan 29, 2018Updated 8 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 3 months ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆53Jun 12, 2020Updated 6 years ago
- An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.☆86Apr 21, 2021Updated 5 years ago
- Data Engineering pipeline hosted entirely in the AWS ecosystem utilizing DocumentDB as the database☆14Oct 26, 2021Updated 4 years ago
- Core Python Web Archiving Toolkit for replay and recording of web archives☆1,667Apr 10, 2026Updated 2 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Dec 4, 2017Updated 8 years ago
- Web archiving utility library☆11May 5, 2026Updated last month
- ☆26Mar 20, 2024Updated 2 years ago
- gzipstream allows Python to process multi-part gzip files from a streaming source☆23Feb 24, 2017Updated 9 years ago
- Search the common crawl using lambda functions☆96Mar 15, 2019Updated 7 years ago
- The pipeline for the OSCAR corpus☆177Nov 9, 2025Updated 7 months ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆15Feb 5, 2026Updated 4 months ago
- ☆15Aug 15, 2012Updated 13 years ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆132Nov 21, 2025Updated 6 months ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- news-please - an integrated web crawler and information extractor for news that just works☆2,458Apr 14, 2026Updated 2 months ago
- Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...☆321Dec 9, 2023Updated 2 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆161Oct 8, 2025Updated 8 months ago
- 🕸 A simple way to extract data from Common Crawl☆35Feb 24, 2020Updated 6 years ago
- Simple multi threaded tool to extract domain related data from commoncrawl.org☆31Jul 17, 2018Updated 7 years ago
- Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs☆11Aug 10, 2018Updated 7 years ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago