Tools to construct and process Common Crawl webgraphs
☆110Jun 29, 2026Updated this week
Alternatives and similar repositories for cc-webgraph
Users that are interested in cc-webgraph are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Process Common Crawl data with Python and Spark☆457Mar 26, 2026Updated 3 months ago
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆20Oct 21, 2020Updated 5 years ago
- Various Jupyter notebooks about Common Crawl data☆66Updated this week
- Index Common Crawl archives in tabular format☆131Jun 25, 2026Updated last week
- A list of awesome JSON-LD resources.☆19Apr 21, 2026Updated 2 months ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Software for building the IR Anthology.☆11Sep 19, 2023Updated 2 years ago
- Object Resource Stream and CDXJ Drafts☆15Nov 28, 2018Updated 7 years ago
- News crawling with StormCrawler - stores content as WARC☆376Updated this week
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- Datasette plugin providing a UI for executing SQL writes against the database☆12Nov 11, 2025Updated 7 months ago
- The inverted index exchange format as defined as part of the Open-Source IR Replicability Challenge (OSIRRC) initiative☆11Aug 6, 2025Updated 10 months ago
- Streaming WARC/ARC library for fast web archive IO☆459Jun 10, 2026Updated 3 weeks ago
- MCP server tailored to connecting web crawler data and archives☆43May 31, 2026Updated last month
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆203Oct 7, 2018Updated 7 years ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Common Crawl fork of Apache Nutch☆42Jun 25, 2026Updated last week
- Recommendation engine for scholarly articles☆12Oct 22, 2019Updated 6 years ago
- Faceted Browsing over Wikidata triples☆18Jun 16, 2018Updated 8 years ago
- A monolithic index that supports worst-case optimal joins (WCOJ) by providing all collation orders in a single redundancy eliminating dat…☆18Sep 18, 2025Updated 9 months ago
- TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling☆29Jul 6, 2023Updated 2 years ago
- Common Index File Format to to support interoperability between open-source IR engines☆40Sep 19, 2024Updated last year
- Git repository syncronisation daemon☆53Feb 21, 2017Updated 9 years ago
- Data and source code for the paper "How choosing random-walk model and network representation matters for flow-based community detection …☆11Jan 7, 2021Updated 5 years ago
- The benjojo.co.uk fork of honk☆15Jan 1, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Make Metabase More Awesome☆16Jul 24, 2024Updated last year
- Internet Archive's Sparkling Data Processing Library☆17May 4, 2026Updated 2 months ago
- Documentation, backgrounders and tutorial material related to information design, engineering, semantics, ontologies, and vocabularies☆19May 22, 2023Updated 3 years ago
- A Slack/Notion Integration to Help Document Slack Conversations!☆23Mar 2, 2020Updated 6 years ago
- Source code of BARTOC.org user interface☆29Jun 24, 2026Updated last week
- ☆16Feb 10, 2026Updated 4 months ago
- Data and code related to the report "Truth, Lies, and Automation: How Language Models Could Change Disinformation"☆28May 18, 2021Updated 5 years ago
- BEACON link dump format specification☆17Jan 4, 2018Updated 8 years ago
- ☆17Dec 11, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Generate IPv4 12th order Hilbert heatmaps from a file of IPv4 addresses.☆13Apr 11, 2024Updated 2 years ago
- Repository for public code and data associated with the paper "Fake News on Twitter During the 2016 U.S. Presidential Election☆12Dec 5, 2019Updated 6 years ago
- ☆14Dec 27, 2016Updated 9 years ago
- CKAN Extensions☆12Aug 26, 2021Updated 4 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆66Aug 5, 2016Updated 9 years ago
- DC Tabular Application Profile - supporting materials☆31Sep 28, 2023Updated 2 years ago
- Proceedings of the annual intercalary robot dance party in celebration of workshop on symposium about 2^6th birthdays; in particular, tha…☆21May 10, 2026Updated last month