Tools to construct and process Common Crawl webgraphs
☆105Feb 19, 2026Updated last month
Alternatives and similar repositories for cc-webgraph
Users that are interested in cc-webgraph are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆212Updated this week
- Process Common Crawl data with Python and Spark☆453Jan 20, 2026Updated 2 months ago
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆20Oct 21, 2020Updated 5 years ago
- Scientific articles using or citing Common Crawl data☆28Updated this week
- Various Jupyter notebooks about Common Crawl data☆64Nov 22, 2025Updated 4 months ago
- A list of awesome JSON-LD resources.☆17Aug 31, 2022Updated 3 years ago
- Index Common Crawl archives in tabular format☆126Updated this week
- Software for building the IR Anthology.☆11Sep 19, 2023Updated 2 years ago
- News crawling with StormCrawler - stores content as WARC☆364Feb 19, 2025Updated last year
- Tool for comparing two ranked lists (TREC run files)☆20Nov 9, 2022Updated 3 years ago
- Extracts plain text, language identification and more metadata from WARC records☆23Oct 1, 2025Updated 5 months ago
- Datasette plugin providing a UI for executing SQL writes against the database☆12Nov 11, 2025Updated 4 months ago
- The inverted index exchange format as defined as part of the Open-Source IR Replicability Challenge (OSIRRC) initiative☆11Aug 6, 2025Updated 7 months ago
- R library for common information retrieval metrics☆14Jun 5, 2023Updated 2 years ago
- Streaming WARC/ARC library for fast web archive IO☆451Dec 10, 2024Updated last year
- My effort to keep myself on nixos while having a comfy and crazy workstation☆10Jul 8, 2024Updated last year
- Implementation of W3C's R2RML and Direct Mapping specifications☆10Oct 12, 2020Updated 5 years ago
- Common Crawl fork of Apache Nutch☆40Mar 12, 2026Updated last week
- A simple Rust library to retrieve data from https://api.carbonintensity.org.uk/☆11Oct 15, 2024Updated last year
- Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)☆31Oct 3, 2023Updated 2 years ago
- TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling☆29Jul 6, 2023Updated 2 years ago
- Common Index File Format to to support interoperability between open-source IR engines☆40Sep 19, 2024Updated last year
- Data and source code for the paper "How choosing random-walk model and network representation matters for flow-based community detection …☆11Jan 7, 2021Updated 5 years ago
- Material for git workshop☆11Mar 13, 2018Updated 8 years ago
- Quit Datasette if it has not received traffic for a specified time period☆17Feb 18, 2026Updated last month
- A high-throughput ontology-based pipeline for data integration☆14May 17, 2023Updated 2 years ago
- ☆14Dec 27, 2016Updated 9 years ago
- An evaluation script based on the C/W/L framework that is TREC Compatible and provides a replacement for TREC_EVAL and independent script…☆15May 1, 2023Updated 2 years ago
- WP-CLI package to "environmentalize" Wordpress installation☆11Oct 25, 2016Updated 9 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- ☆15Jul 29, 2020Updated 5 years ago
- MLIR backend for Nx☆14May 24, 2024Updated last year
- A whirlwind tour of Common Crawl's data using Python☆37Feb 17, 2026Updated last month
- Twitter Discovery: Search articles referenced in your tweets, retweets, and favorites☆16Jun 16, 2020Updated 5 years ago
- Expose Datasette instances to LLM as a tool☆27May 27, 2025Updated 9 months ago
- ☆15Feb 22, 2021Updated 5 years ago
- Toolkit for domain-specific information retrieval experimentation☆19Feb 24, 2026Updated last month
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆201Jan 23, 2026Updated 2 months ago
- ☆21Updated this week