rom1504 / cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
☆310Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for cc2dataset
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆189Updated 2 months ago
- Used for adaptive human in the loop evaluation of language and embedding models.☆304Updated last year
- DataComp: In search of the next generation of multimodal datasets☆657Updated 10 months ago
- Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT☆205Updated 3 months ago
- Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch☆225Updated 2 months ago
- Get hundred of million of image+url from the crawling at home dataset and preprocess them☆206Updated 5 months ago
- Scaling Data-Constrained Language Models☆321Updated last month
- Multipack distributed sampler for fast padding-free training of LLMs☆178Updated 3 months ago
- Internet Explorer explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desi…☆163Updated last year
- Efficiently read embedding in streaming from any filesystem☆96Updated 6 months ago
- ☆292Updated 4 months ago
- Exploring finetuning public checkpoints on filter 8K sequences on Pile☆115Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆174Updated last year
- Aim for the moon. If you miss, you may hit a star.☆160Updated last year
- Language Modeling with the H3 State Space Model☆513Updated last year
- Open reproduction of MUSE for fast text2image generation.☆332Updated 5 months ago
- JAX implementation of the Llama 2 model☆210Updated 9 months ago
- ☆100Updated 9 months ago
- Implementation of Recurrent Memory Transformer, Neurips 2022 paper, in Pytorch☆394Updated this week
- Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for training☆164Updated last year
- Experiments with generating opensource language model assistants☆97Updated last year
- 🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".☆478Updated last year
- Easily compute clip embeddings from video frames☆136Updated last year
- ☆334Updated 7 months ago
- Simple large-scale training of stable diffusion with multi-node support.☆126Updated last year
- git extension for {collaborative, communal, continual} model development☆205Updated this week
- Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks☆206Updated 10 months ago
- A repository for research on medium sized language models.☆479Updated this week
- Easily create large video dataset from video urls☆546Updated 3 months ago