rom1504/cc2dataset

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/rom1504/cc2dataset)

rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

☆321

Alternatives and similar repositories for cc2dataset

Users that are interested in cc2dataset are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

LAION-AI / laion50BU
View on GitHub
Un-*** 50 billions multimodality dataset
☆24Sep 14, 2022Updated 3 years ago
mlfoundations / datacomp
View on GitHub
DataComp: In search of the next generation of multimodal datasets
☆787Apr 28, 2025Updated last year
LAION-AI / Big-Interleaved-Dataset
View on GitHub
Big-Interleaved-Dataset
☆59Jan 21, 2023Updated 3 years ago
rom1504 / img2dataset
View on GitHub
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
☆4,435Oct 19, 2025Updated 9 months ago
allenai / mmc4
View on GitHub
MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
☆953Mar 19, 2025Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
kakaobrain / coyo-dataset
View on GitHub
COYO-700M: Large-scale Image-Text Pair Dataset
☆1,256Nov 30, 2022Updated 3 years ago
afiaka87 / laionide
View on GitHub
checkpoints for glide finetuned on laion and other datasets. wip.
☆50Aug 17, 2022Updated 3 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,046Apr 25, 2023Updated 3 years ago
lucidrains / nim-tokenizer
View on GitHub
Implementation of a simple BPE tokenizer, but in Nim
☆22Jul 2, 2023Updated 3 years ago
facebookresearch / dmae_st
View on GitHub
Directed masked autoencoders
☆14Mar 25, 2026Updated 3 months ago
rom1504 / laion-prepro
View on GitHub
Get hundred of million of image+url from the crawling at home dataset and preprocess them
☆222May 26, 2024Updated 2 years ago
kakaobrain / karlo
View on GitHub
☆699Mar 6, 2023Updated 3 years ago
iejMac / video2dataset
View on GitHub
Easily create large video dataset from video urls
☆660Jul 30, 2024Updated last year
shayne-longpre / a-pretrainers-guide
View on GitHub
☆71May 22, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
LooperXX / ManagerTower
View on GitHub
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
☆12Aug 23, 2025Updated 10 months ago
kingoflolz / CLIP_JAX
View on GitHub
Contrastive Language-Image Pretraining
☆147Sep 6, 2022Updated 3 years ago
lucidrains / x-clip
View on GitHub
A concise but complete implementation of CLIP with various experimental improvements from recent papers
☆724Oct 16, 2023Updated 2 years ago
facebookresearch / CiT
View on GitHub
Code for the paper titled "CiT Curation in Training for Effective Vision-Language Data".
☆78Jan 18, 2023Updated 3 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jun 19, 2026Updated last month
criteo / autofaiss
View on GitHub
Automatically create Faiss knn indices with the most optimal similarity search parameters.
☆906Nov 4, 2025Updated 8 months ago
pbaylies / Augmented_CLIP
View on GitHub
Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.
☆60Mar 31, 2022Updated 4 years ago
seonghyeonye / Flipped-Learning
View on GitHub
[ICLR 2023] Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
☆117Jun 28, 2025Updated last year
huggingface / OBELICS
View on GitHub
Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…
☆215Aug 28, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
google-research-datasets / wit
View on GitHub
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique imag…
☆1,113Sep 27, 2024Updated last year
ClashLuke / tpucare
View on GitHub
Automatically take good care of your preemptible TPUs
☆37May 15, 2023Updated 3 years ago
antofuller / configaformers
View on GitHub
A python library for highly configurable transformers - easing model architecture search and experimentation.
☆48Nov 30, 2021Updated 4 years ago
LAION-AI / LAION-PEOPLE
View on GitHub
This project provides a data set with bounding boxes, body poses, 3D face meshes & captions of people from our LAION-2.2B. Additionally i…
☆14Jan 2, 2022Updated 4 years ago
iejMac / video2numpy
View on GitHub
Optimized library for large-scale extraction of frames and audio from video.
☆203Sep 11, 2023Updated 2 years ago
facebookresearch / distributed-faiss
View on GitHub
A library for building and serving multi-node distributed faiss indices.
☆280Nov 1, 2023Updated 2 years ago
TheoCoombes / crawlingathome
View on GitHub
A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.
☆33Mar 21, 2023Updated 3 years ago
webdataset / webdataset
View on GitHub
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
☆3,147Feb 9, 2026Updated 5 months ago
crowsonkb / cloob-training
View on GitHub
CLOOB training (JAX) and inference (JAX and PyTorch)
☆76May 16, 2022Updated 4 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
LAION-AI / CLIP_benchmark
View on GitHub
CLIP-like model evaluation
☆814Mar 19, 2026Updated 4 months ago
LAION-AI / Open-Instruction-Generalist
View on GitHub
Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
☆210Jan 13, 2024Updated 2 years ago
rom1504 / clip-retrieval
View on GitHub
Easily compute clip embeddings and build a clip retrieval system with them
☆2,786Mar 28, 2026Updated 3 months ago
rom1504 / any2dataset
View on GitHub
Turn any collection of files into a dataset
☆45Mar 10, 2023Updated 3 years ago
jasonppy / syllable-discovery
View on GitHub
Syllable Segmentation and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
☆35Aug 27, 2023Updated 2 years ago
borisdayma / clip-jax
View on GitHub
Train vision models using JAX and 🤗 transformers
☆103Dec 14, 2025Updated 7 months ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,217Updated this week