commoncrawl/cc-mrjob

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-mrjob)

commoncrawl / cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

☆168

Alternatives and similar repositories for cc-mrjob

Users that are interested in cc-mrjob are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 4 months ago
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
ikreymer / cdx-index-client
View on GitHub
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆203Oct 7, 2018Updated 7 years ago
internetarchive / warc
View on GitHub
Python library for reading and writing warc files
☆249Mar 7, 2022Updated 4 years ago
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated 3 weeks ago
commoncrawl / gzipstream
View on GitHub
gzipstream allows Python to process multi-part gzip files from a streaming source
☆23Feb 24, 2017Updated 9 years ago
rossf7 / wikireverse
View on GitHub
Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.
☆38Aug 12, 2018Updated 7 years ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
newsreader / eso-and-ceo
View on GitHub
Events and Situations Ontology
☆15Apr 20, 2018Updated 8 years ago
cligs / tmw
View on GitHub
Topic Modeling Workflow in Python
☆16Feb 18, 2023Updated 3 years ago
webrecorder / warcio
View on GitHub
Streaming WARC/ARC library for fast web archive IO
☆462Jun 10, 2026Updated last month
commoncrawl / cc-crawl-statistics
View on GitHub
Statistics of Common Crawl monthly archives mined from URL index files
☆227Updated this week
commoncrawl / cdx_toolkit
View on GitHub
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆208Jun 24, 2026Updated last month
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
CI-Research / KeywordAnalysis
View on GitHub
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Jan 28, 2024Updated 2 years ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 3 weeks ago
centic9 / CommonCrawlDocumentDownload
View on GitHub
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…
☆74Jul 11, 2026Updated 2 weeks ago
emijrp / wikidata
View on GitHub
Scripts for Wikidata
☆21Jul 3, 2026Updated 3 weeks ago
socialsensor / storm-focused-crawler
View on GitHub
Collects multimedia content shared through social networks.
☆19Feb 18, 2015Updated 11 years ago
N0taN3rd / simplechrome
View on GitHub
Webrecorders DevTools Protocol Automation Library
☆18Oct 18, 2022Updated 3 years ago
xiaoganghan / wikientities
View on GitHub
Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts
☆59Sep 5, 2012Updated 13 years ago
webrecorder / pywb
View on GitHub
Core Python Web Archiving Toolkit for replay and recording of web archives
☆1,684Apr 10, 2026Updated 3 months ago
norakassner / mlama
View on GitHub
☆25Jan 22, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
pdasigi / neural-semantic-encoders
View on GitHub
Reimplementation of Munkhdalai et al's Neural Semantic Encoders (https://arxiv.org/pdf/1607.04315v2.pdf)
☆59Oct 28, 2016Updated 9 years ago
00krishna-tools / gdelt_download
View on GitHub
Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org
☆12May 17, 2014Updated 12 years ago
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
sanjaymeena / InformationExtractionSystem
View on GitHub
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
☆27Apr 23, 2014Updated 12 years ago
aws-samples / amazon-s3-security-settings-and-controls
View on GitHub
☆22Feb 17, 2020Updated 6 years ago
aleemrehmtulla / gpt3-google-extension
View on GitHub
get direct answers in google using LLMs
☆18Apr 12, 2023Updated 3 years ago
Smerity / gzipstream
View on GitHub
gzipstream allows Python to process multi-part gzip files from a streaming source
☆17Jun 10, 2016Updated 10 years ago
vi3k6i5 / synonym-extractor
View on GitHub
Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm
☆40Aug 17, 2017Updated 8 years ago
OpenText-org / original_annotation
View on GitHub
XML files for linguistic annotation of the Greek New Testament
☆13Jun 12, 2018Updated 8 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
mollerhoj / Scandinavian-ULMFiT
View on GitHub
The weights for the embedding layer of Scandinavian UMLFiT language models
☆32Dec 5, 2019Updated 6 years ago
bwilbertz / kaggle_allen_ai
View on GitHub
kaggle allen ai competition
☆17Feb 23, 2016Updated 10 years ago
rycolab / entropyRegularization
View on GitHub
Code for Generalized Entropy Regularization paper
☆14May 2, 2020Updated 6 years ago
igorlukanin / coursera-hse-machine-learning
View on GitHub
My source code and solutions for the machine learning course on Coursera
☆12Mar 17, 2016Updated 10 years ago
openeventdata / phoenix_pipeline
View on GitHub
Turning news into events since 2014.
☆52May 1, 2017Updated 9 years ago
matejsuchanek / pywikibot-scripts
View on GitHub
Own pywikibot scripts (for Wikimedia projects)
☆21Jul 11, 2026Updated 2 weeks ago
cnap / grammaticality-metrics
View on GitHub
evaluation suite for testing automatic grammatical error corrections
☆40Jun 12, 2017Updated 9 years ago