Tools to construct and process Common Crawl webgraphs
☆105Feb 19, 2026Updated last week
Alternatives and similar repositories for cc-webgraph
Users that are interested in cc-webgraph are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆210Updated this week
- Process Common Crawl data with Python and Spark☆452Jan 20, 2026Updated last month
- Organizing and publishing the web domains of the US federal government☆17Sep 2, 2018Updated 7 years ago
- Various Jupyter notebooks about Common Crawl data☆64Nov 22, 2025Updated 3 months ago
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆20Oct 21, 2020Updated 5 years ago
- Scientific articles using or citing Common Crawl data☆28Jan 9, 2026Updated last month
- A list of awesome JSON-LD resources.☆17Aug 31, 2022Updated 3 years ago
- Tool for comparing two ranked lists (TREC run files)☆20Nov 9, 2022Updated 3 years ago
- News crawling with StormCrawler - stores content as WARC☆364Feb 19, 2025Updated last year
- Index Common Crawl archives in tabular format☆125Feb 19, 2026Updated last week
- R library for common information retrieval metrics☆14Jun 5, 2023Updated 2 years ago
- Datasette plugin providing a UI for executing SQL writes against the database☆12Nov 11, 2025Updated 3 months ago
- Software for building the IR Anthology.☆11Sep 19, 2023Updated 2 years ago
- Make Metabase More Awesome☆16Jul 24, 2024Updated last year
- Simulated user for TREC 2016-2017 Dynamic Domain track☆10Dec 27, 2017Updated 8 years ago
- A monolithic index that supports worst-case optimal joins (WCOJ) by providing all collation orders in a single redundancy eliminating dat…☆16Sep 18, 2025Updated 5 months ago
- Object Resource Stream and CDXJ Drafts☆14Nov 28, 2018Updated 7 years ago
- The inverted index exchange format as defined as part of the Open-Source IR Replicability Challenge (OSIRRC) initiative☆11Aug 6, 2025Updated 6 months ago
- Generative Reranker PyTerrier☆18Dec 1, 2025Updated 3 months ago
- A high-throughput ontology-based pipeline for data integration☆14May 17, 2023Updated 2 years ago
- The benjojo.co.uk fork of honk☆15Jan 1, 2025Updated last year
- Recommendation engine for scholarly articles☆12Oct 22, 2019Updated 6 years ago
- Expose Datasette instances to LLM as a tool☆26May 27, 2025Updated 9 months ago
- ☆12Jun 27, 2024Updated last year
- DimmWitted Gibbs Sampler in C++ — ⚠️🚧🛑 REPO MOVED TO DEEPDIVE 👉🏿☆17Jan 23, 2017Updated 9 years ago
- Data and source code for the paper "How choosing random-walk model and network representation matters for flow-based community detection …☆11Jan 7, 2021Updated 5 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆206Oct 7, 2018Updated 7 years ago
- TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling☆29Jul 6, 2023Updated 2 years ago
- An evaluation script based on the C/W/L framework that is TREC Compatible and provides a replacement for TREC_EVAL and independent script…☆15May 1, 2023Updated 2 years ago
- Quit Datasette if it has not received traffic for a specified time period☆17Feb 18, 2026Updated last week
- ☆16Dec 11, 2024Updated last year
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Common Index File Format to to support interoperability between open-source IR engines☆40Sep 19, 2024Updated last year
- Vector Space Model Framework developed for InPhO☆39May 9, 2025Updated 9 months ago
- Linked Data Competency Index☆19Mar 11, 2021Updated 4 years ago
- A friendly pandas wrapper with a more composable grammar support.☆14Mar 7, 2017Updated 8 years ago
- Streaming WARC/ARC library for fast web archive IO☆451Dec 10, 2024Updated last year
- Repository for my master thesis on automated string handling☆17Jul 17, 2021Updated 4 years ago
- Applying Reinforcement Learning from Human Feedback to language models to teach them to write short story responses to writing prompts.☆14May 5, 2022Updated 3 years ago