Tools to construct and process Common Crawl webgraphs
☆108May 25, 2026Updated 2 weeks ago
Alternatives and similar repositories for cc-webgraph
Users that are interested in cc-webgraph are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- WebGraph is a framework for graph compression.☆17Jun 22, 2025Updated 11 months ago
- Process Common Crawl data with Python and Spark☆455Mar 26, 2026Updated 2 months ago
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆20Oct 21, 2020Updated 5 years ago
- Various Jupyter notebooks about Common Crawl data☆66Nov 22, 2025Updated 6 months ago
- Index Common Crawl archives in tabular format☆128Jun 4, 2026Updated last week
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Software for building the IR Anthology.☆11Sep 19, 2023Updated 2 years ago
- Object Resource Stream and CDXJ Drafts☆15Nov 28, 2018Updated 7 years ago
- News crawling with StormCrawler - stores content as WARC☆372Updated this week
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- The inverted index exchange format as defined as part of the Open-Source IR Replicability Challenge (OSIRRC) initiative☆11Aug 6, 2025Updated 10 months ago
- R library for common information retrieval metrics☆14Jun 5, 2023Updated 3 years ago
- Streaming WARC/ARC library for fast web archive IO☆458Updated this week
- Implementation of W3C's R2RML and Direct Mapping specifications☆10Oct 12, 2020Updated 5 years ago
- MCP server tailored to connecting web crawler data and archives☆42May 31, 2026Updated 2 weeks ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆203Oct 7, 2018Updated 7 years ago
- Common Crawl fork of Apache Nutch☆41Jun 3, 2026Updated last week
- Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)☆31Oct 3, 2023Updated 2 years ago
- Recommendation engine for scholarly articles☆12Oct 22, 2019Updated 6 years ago
- A monolithic index that supports worst-case optimal joins (WCOJ) by providing all collation orders in a single redundancy eliminating dat…☆18Sep 18, 2025Updated 8 months ago
- TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling☆29Jul 6, 2023Updated 2 years ago
- Common Index File Format to to support interoperability between open-source IR engines☆40Sep 19, 2024Updated last year
- A Python Wrapper To Retrieve Data From The CrowdTangle API☆11Mar 26, 2026Updated 2 months ago
- Generative Reranker PyTerrier☆18Dec 1, 2025Updated 6 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Make Metabase More Awesome☆16Jul 24, 2024Updated last year
- Internet Archive's Sparkling Data Processing Library☆17May 4, 2026Updated last month
- MOVED to https://gitlab.com/crossref/reference_matching_evaluation_framework☆17Jul 1, 2019Updated 6 years ago
- Documentation, backgrounders and tutorial material related to information design, engineering, semantics, ontologies, and vocabularies☆18May 22, 2023Updated 3 years ago
- Source code of BARTOC.org user interface☆28Updated this week
- An OpenCalais API Interface for Python.☆21Mar 13, 2012Updated 14 years ago
- A high-throughput ontology-based pipeline for data integration☆16May 17, 2023Updated 3 years ago
- Data and code related to the report "Truth, Lies, and Automation: How Language Models Could Change Disinformation"☆28May 18, 2021Updated 5 years ago
- Create an Anime database containing all the Anime currently available on the website, which includes: 'Anime Title', 'Description', 'C…☆11Jun 10, 2020Updated 6 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- ☆17Dec 11, 2024Updated last year
- Generate IPv4 12th order Hilbert heatmaps from a file of IPv4 addresses.☆13Apr 11, 2024Updated 2 years ago
- Repository for public code and data associated with the paper "Fake News on Twitter During the 2016 U.S. Presidential Election☆12Dec 5, 2019Updated 6 years ago
- CKAN Extensions☆12Aug 26, 2021Updated 4 years ago
- A distributed system for mining common crawl using SQS, AWS-EC2 and S3☆22Jun 24, 2014Updated 11 years ago
- An evaluation script based on the C/W/L framework that is TREC Compatible and provides a replacement for TREC_EVAL and independent script…☆15May 1, 2023Updated 3 years ago
- WP-CLI package to "environmentalize" Wordpress installation☆11Oct 25, 2016Updated 9 years ago