commoncrawl/cc-webgraph

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-webgraph)

commoncrawl / cc-webgraph

Tools to construct and process Common Crawl webgraphs

☆111

Alternatives and similar repositories for cc-webgraph

Users that are interested in cc-webgraph are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
commoncrawl / cc-citations
View on GitHub
Scientific articles using or citing Common Crawl data
☆29Jul 8, 2026Updated 2 weeks ago
commoncrawl / cc-notebooks
View on GitHub
Various Jupyter notebooks about Common Crawl data
☆66Jul 3, 2026Updated 3 weeks ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
json-ld / awesome-json-ld
View on GitHub
A list of awesome JSON-LD resources.
☆19Apr 21, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
AnswerDotAI / nbdev-index
View on GitHub
nbdev docs lookup for a few libraries and python itself
☆25Feb 9, 2026Updated 5 months ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
capreolus-ir / diffir
View on GitHub
Tool for comparing two ranked lists (TREC run files)
☆20Nov 9, 2022Updated 3 years ago
commoncrawl / cc-host-index
View on GitHub
Tools for working with the host index
☆15Jun 1, 2026Updated last month
bitextor / warc2text
View on GitHub
Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated 3 months ago
microsoft / irmetrics-r
View on GitHub
R library for common information retrieval metrics
☆14Jun 5, 2023Updated 3 years ago
ikreymer / cdx-index-client
View on GitHub
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆203Oct 7, 2018Updated 7 years ago
arcolife / scholarec
View on GitHub
Recommendation engine for scholarly articles
☆12Oct 22, 2019Updated 6 years ago
Explore-AI / cloud-computing-predict
View on GitHub
☆10Jun 29, 2026Updated 3 weeks ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
dice-group / hypertrie
View on GitHub
A monolithic index that supports worst-case optimal joins (WCOJ) by providing all collation orders in a single redundancy eliminating dat…
☆18Updated this week
osirrc / ciff
View on GitHub
Common Index File Format to to support interoperability between open-source IR engines
☆40Sep 19, 2024Updated last year
thunderpoot / scdx
View on GitHub
A simple tool for querying the Common Crawl CDX
☆16Jan 10, 2026Updated 6 months ago
mapequation / mapping-hypergraphs
View on GitHub
Data and source code for the paper "How choosing random-walk model and network representation matters for flow-based community detection …
☆12Jan 7, 2021Updated 5 years ago
vordimous / gohlay
View on GitHub
The Kafka message scheduling tool.
☆19Jan 20, 2025Updated last year
biocodellc / ontology-data-pipeline
View on GitHub
A high-throughput ontology-based pipeline for data integration
☆16May 17, 2023Updated 3 years ago
CrossRef / reference-matching-evaluation
View on GitHub
MOVED to https://gitlab.com/crossref/reference_matching_evaluation_framework
☆17Jul 1, 2019Updated 7 years ago
ckan / ckantoolkit
View on GitHub
Backports for ckan.plugins.toolkit to ease CKAN extension compatibility
☆17Apr 6, 2022Updated 4 years ago
filipecasal / knowledge-repo
View on GitHub
☆15Feb 22, 2021Updated 5 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
georgetown-cset / GPT3-Disinformation
View on GitHub
Data and code related to the report "Truth, Lies, and Automation: How Language Models Could Change Disinformation"
☆28May 18, 2021Updated 5 years ago
10Kang / DE_Zoomcamp2024_ZY
View on GitHub
Repository for Data Engineering Zoomcamp 2024
☆14Mar 25, 2024Updated 2 years ago
gfjreg / CommonCrawl
View on GitHub
A distributed system for mining common crawl using SQS, AWS-EC2 and S3
☆22Jun 24, 2014Updated 12 years ago
gbv / beaconspec
View on GitHub
BEACON link dump format specification
☆17Jan 4, 2018Updated 8 years ago
LazerLab / twitter-fake-news-replication
View on GitHub
Repository for public code and data associated with the paper "Fake News on Twitter During the 2016 U.S. Presidential Election
☆12Dec 5, 2019Updated 6 years ago
therealshabi / Wikidata-bot
View on GitHub
A chatbot for querying from Wikidata and Dbpedia
☆15May 11, 2018Updated 8 years ago
michaloo / wp-cli-environmentalize
View on GitHub
WP-CLI package to "environmentalize" Wordpress installation
☆12Oct 25, 2016Updated 9 years ago
dcmi / dcap
View on GitHub
DC Tabular Application Profile - supporting materials
☆32Sep 28, 2023Updated 2 years ago
joseignm / GraFa
View on GitHub
Faceted Browsing over Wikidata triples
☆18Jun 16, 2018Updated 8 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ryanbrownnetworking777 / dataengineerio-capstone-ryanbrown
View on GitHub
capstone project for Dataengineer.io bootcamp Public Repo
☆12Feb 20, 2024Updated 2 years ago
ncbo / bioportal_web_ui
View on GitHub
A Rails application for biological ontologies
☆24Updated this week
koaning / kadro
View on GitHub
A friendly pandas wrapper with a more composable grammar support.
☆13Mar 7, 2017Updated 9 years ago
oasis-tcs / ubl
View on GitHub
OASIS UBL TC: A public GitHub repository for the committee-member collaborative activity in developing the raw materials for inclusion in…
☆15Apr 13, 2026Updated 3 months ago
hscells / pybool_ir
View on GitHub
Toolkit for domain-specific information retrieval experimentation
☆19May 18, 2026Updated 2 months ago
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
ckan / extensions.ckan.org
View on GitHub
CKAN Extensions
☆12Aug 26, 2021Updated 4 years ago