commoncrawl/cc-notebooks

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/cc-notebooks)

commoncrawl / cc-notebooks

Various Jupyter notebooks about Common Crawl data

☆66

Alternatives and similar repositories for cc-notebooks

Users that are interested in cc-notebooks are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
commoncrawl / cc-crawl-statistics
View on GitHub
Statistics of Common Crawl monthly archives mined from URL index files
☆226Updated this week
commoncrawl / cc-webgraph
View on GitHub
Tools to construct and process Common Crawl webgraphs
☆111Updated this week
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
ikreymer / cc-index-server
View on GitHub
Deployment of pywb as a CommonCrawl Index Server
☆22Oct 6, 2017Updated 8 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
minimaxir / resetera-gpt-2
View on GitHub
Scraper of ResetEra threads and posts to get them into a format suitable for feeding them into GPT-2.
☆15Jun 20, 2019Updated 7 years ago
josephrocca / lit-encoder-js
View on GitHub
LiT (Zero-Shot Transfer with Locked-image text Tuning) image and text encoder models, working in the browser
☆11May 16, 2022Updated 4 years ago
ROBLOX12 / ROBLOX
View on GitHub
☆10Apr 10, 2014Updated 12 years ago
tml-epfl / icl-alignment
View on GitHub
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆33Jan 23, 2025Updated last year
codingforentrepreneurs / Smarter-Web-Scraping-with-Python
View on GitHub
Leverage modern open-source tools to create better web scraping workflows.
☆31Feb 29, 2024Updated 2 years ago
zcaceres / builtwith-api
View on GitHub
TypeScript library, MCP, and agent-friendly CLI for the BuiltWith API.
☆23Jul 8, 2026Updated 2 weeks ago
tballison / file-observatory
View on GitHub
Single server/laptop grade file-observatory
☆10Mar 30, 2023Updated 3 years ago
mattn / entgo-bbs
View on GitHub
☆10May 17, 2022Updated 4 years ago
Data4Democracy / assemble
View on GitHub
NOT AN ACTIVE PROJECT -- Check readme for data sources
☆36May 28, 2017Updated 9 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
tsafavi / cascader
View on GitHub
CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction (arXiv 22)
☆13Jun 17, 2022Updated 4 years ago
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
datahoarder / ca-warn
View on GitHub
A collection and conversion of WARN notices from California
☆12May 13, 2016Updated 10 years ago
kambrium / staticmapservice
View on GitHub
A web service that generates static maps
☆18Dec 30, 2025Updated 6 months ago
terrierteam / pyterrier_t5
View on GitHub
☆17Apr 30, 2026Updated 2 months ago
milosgajdos / embeviz
View on GitHub
A simple app for visualising text embeddings
☆24Jul 20, 2025Updated last year
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
phaniteja1 / react-csv-viewer
View on GitHub
React Component for Uploading and Viewing your CSV File as a table
☆15Feb 18, 2023Updated 3 years ago
jihyeseo / docker-noin
View on GitHub
독거노인: 독일에서 일하는 (한국인) 노동자를 위한 (한국어) 정보
☆11Jun 7, 2020Updated 6 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
cloudmercato / os-benchmark
View on GitHub
Handy tool for Object Storage performance benchmark
☆12Sep 8, 2025Updated 10 months ago
kaz-Anova / Competitive_Dai
View on GitHub
The code to generate a top 20 score in the amazon classification challenge using Driverless AI's predictions and feature engineering : In…
☆19Dec 2, 2017Updated 8 years ago
w-e-ll / Google-Maps-Hotel-Async-Python-Scraper
View on GitHub
Hotel data scraper from Google Maps service
☆11Jul 6, 2022Updated 4 years ago
matheusportela / web-crawler
View on GitHub
Didactic Web crawler for Web Search Engines (CS 6913) course at NYU
☆10Dec 8, 2022Updated 3 years ago
JulianEberius / dwtc-extractor
View on GitHub
Extraction code used to create the Dresden Web Table Corpus
☆14Feb 25, 2015Updated 11 years ago
EventStudyTools / api-wrapper.r
View on GitHub
☆10Jul 7, 2026Updated 2 weeks ago
astutic / brat-standoff-to-json
View on GitHub
Converts brat standoff format to JSONL format
☆13Jan 29, 2022Updated 4 years ago
SIDN / pathvis
View on GitHub
PathVis visualises traceroutes
☆11Jan 25, 2024Updated 2 years ago
DrMeepso / Dall-e-Mini-Bot
View on GitHub
☆12Sep 9, 2022Updated 3 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
ceshine / examples
View on GitHub
A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
☆12Oct 12, 2018Updated 7 years ago
Mihir3009 / In-BoXBART
View on GitHub
In-BoXBART: Get Instructions into Biomedical Multi-task Learning
☆15Aug 23, 2022Updated 3 years ago
commoncrawl / cc-index-server
View on GitHub
Common Crawl Index Server
☆71Feb 28, 2025Updated last year
minqi / wordcraft
View on GitHub
An environment for benchmarking commonsense agents
☆30Aug 19, 2020Updated 5 years ago
dolgov / TT-IRT
View on GitHub
Inverse Rosenblatt Transform (Conditional Distribution) + MCMC sampling using Tensor Train approximation
☆14Mar 17, 2025Updated last year
jungokasai / beam_with_patience
View on GitHub
☆46Apr 13, 2022Updated 4 years ago
haydenhw / commoncrawl-emr-tutorial
View on GitHub
☆12Mar 5, 2021Updated 5 years ago