Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
☆168Jan 27, 2026Updated last month
Alternatives and similar repositories for cc-mrjob
Users that are interested in cc-mrjob are comparing it to the libraries listed below
Sorting:
- Process Common Crawl data with Python and Spark☆452Jan 20, 2026Updated last month
- gzipstream allows Python to process multi-part gzip files from a streaming source☆23Feb 24, 2017Updated 9 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- Python library for reading and writing warc files☆248Mar 7, 2022Updated 4 years ago
- Extracting Entities with Limited Evidence☆16Dec 26, 2022Updated 3 years ago
- Significant rewrite of the LodLive tool for RDF visualization and SPARQL generation☆17Apr 13, 2017Updated 8 years ago
- News crawling with StormCrawler - stores content as WARC☆364Feb 19, 2025Updated last year
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Sep 5, 2012Updated 13 years ago
- Dmoz RDF parser☆28Jun 22, 2016Updated 9 years ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆19Aug 28, 2023Updated 2 years ago
- python 3 versions of code for the book Make Your Own Mandelbrot☆11Dec 28, 2023Updated 2 years ago
- Index Common Crawl archives in tabular format☆125Feb 19, 2026Updated 2 weeks ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- A-Frame GLTF Exporter component☆13Apr 13, 2018Updated 7 years ago
- Python Flask-RESTful template for cookiecutter☆11Mar 31, 2016Updated 9 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆200Jan 23, 2026Updated last month
- Streaming WARC/ARC library for fast web archive IO☆451Dec 10, 2024Updated last year
- Tool to dump all GPS traces collected by/for the OpenStreetMap project.☆25Mar 6, 2019Updated 7 years ago
- Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.☆27Apr 23, 2014Updated 11 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- An academic open source and open data web crawler☆27Nov 20, 2017Updated 8 years ago
- Distributed twitter crawler in Python☆25Nov 4, 2022Updated 3 years ago
- Apache Nutch fork tunned for web services and data discovery.☆10May 18, 2015Updated 10 years ago
- ☆22Feb 29, 2024Updated 2 years ago
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- The Snorocket Description Logic classifier for EL++ with concrete domains support☆24Oct 7, 2021Updated 4 years ago
- Simple time tracking app built with MEAN stack: Angular.js, Node.js, Express, MongoDB☆14Aug 2, 2016Updated 9 years ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- gzipstream allows Python to process multi-part gzip files from a streaming source☆17Jun 10, 2016Updated 9 years ago
- Classifier for predicting user interests based on Twitter profile and using Python library scikit-learn.☆31Jun 7, 2013Updated 12 years ago
- Core Python Web Archiving Toolkit for replay and recording of web archives☆1,629Jan 21, 2026Updated last month
- evaluation suite for testing automatic grammatical error corrections☆39Jun 12, 2017Updated 8 years ago
- get direct answers in google using LLMs☆18Apr 12, 2023Updated 2 years ago
- Scripts for Wikidata☆21Jan 10, 2026Updated last month
- Retrofit Word Vectors to a Sense Ontology to Derive Word Sense Vectors☆17Mar 5, 2015Updated 11 years ago
- Flask Cognises: AWS Cognito group based authorization with user management☆15Dec 8, 2022Updated 3 years ago
- Turning news into events since 2014.☆51May 1, 2017Updated 8 years ago
- kaggle allen ai competition☆17Feb 23, 2016Updated 10 years ago
- spaCy-to-naf converter☆21Jun 10, 2025Updated 8 months ago