Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
☆168Jan 27, 2026Updated 4 months ago
Alternatives and similar repositories for cc-mrjob
Users that are interested in cc-mrjob are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆53Jun 12, 2020Updated 6 years ago
- Index URLs in Common Crawl☆197Sep 19, 2017Updated 8 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆203Oct 7, 2018Updated 7 years ago
- gzipstream allows Python to process multi-part gzip files from a streaming source☆23Feb 24, 2017Updated 9 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Python library for reading and writing warc files☆249Mar 7, 2022Updated 4 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Dec 4, 2017Updated 8 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆37Aug 12, 2018Updated 7 years ago
- Backend of Common Search. Analyses webpages and sends them to the index.☆122May 31, 2017Updated 9 years ago
- Common web archive utility code.☆64Jun 3, 2026Updated last week
- News crawling with StormCrawler - stores content as WARC☆372Updated this week
- Events and Situations Ontology☆14Apr 20, 2018Updated 8 years ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆18Aug 28, 2023Updated 2 years ago
- Topic Modeling Workflow in Python☆16Feb 18, 2023Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Python 3 library for reading and writing warc files☆21Jan 29, 2018Updated 8 years ago
- Streaming WARC/ARC library for fast web archive IO☆458Updated this week
- Extracting Entities with Limited Evidence☆16Dec 26, 2022Updated 3 years ago
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Sep 5, 2012Updated 13 years ago
- Statistics of Common Crawl monthly archives mined from URL index files☆222May 26, 2026Updated 3 weeks ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆209Jun 8, 2026Updated last week
- Programatically validate pull requests against a project's contribution guidelines☆11Nov 2, 2015Updated 10 years ago
- Significant rewrite of the LodLive tool for RDF visualization and SPARQL generation☆17Apr 13, 2017Updated 9 years ago
- Index Common Crawl archives in tabular format☆129Updated this week
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- an experimental implementation of Burrow's delta in Python 3☆12Jun 6, 2017Updated 9 years ago
- Webrecorders DevTools Protocol Automation Library☆18Oct 18, 2022Updated 3 years ago
- Scripts for Wikidata☆21Jan 10, 2026Updated 5 months ago
- Python Flask-RESTful template for cookiecutter☆11Mar 31, 2016Updated 10 years ago
- A cli tool to open Java doc packaged as Jar in browser☆15Jan 30, 2024Updated 2 years ago
- Apache Nutch fork tunned for web services and data discovery.☆10May 18, 2015Updated 11 years ago
- Showcasing various NLP Downstream tasks Training with pre-trained Language models using Pytorch Lightning☆13Aug 7, 2022Updated 3 years ago
- ☆25Jan 22, 2024Updated 2 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Reimplementation of Munkhdalai et al's Neural Semantic Encoders (https://arxiv.org/pdf/1607.04315v2.pdf)☆59Oct 28, 2016Updated 9 years ago
- Distributed twitter crawler in Python☆25Nov 4, 2022Updated 3 years ago
- Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm☆40Aug 17, 2017Updated 8 years ago
- gzipstream allows Python to process multi-part gzip files from a streaming source☆17Jun 10, 2016Updated 10 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 12 years ago
- ☆22Feb 17, 2020Updated 6 years ago
- A NodeJS and/or Flask backend written in Sudolang☆19Apr 30, 2023Updated 3 years ago