kohjingyu / fromageLinks
š§ Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
ā482Updated last year
Alternatives and similar repositories for fromage
Users that are interested in fromage are comparing it to the libraries listed below
Sorting:
- š Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".ā464Updated last year
- DataComp: In search of the next generation of multimodal datasetsā743Updated 4 months ago
- [NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"ā316Updated last year
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.ā353Updated last month
- Official Repository of ChatCaptionerā465Updated 2 years ago
- Implementation of the deepmind Flamingo vision-language model, based on Hugging Face language models and ready for trainingā168Updated 2 years ago
- GIT: A Generative Image-to-text Transformer for Vision and Languageā572Updated last year
- ā228Updated last year
- Implementation of 𦩠Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorchā1,261Updated 2 years ago
- ā625Updated last year
- Code release for "Learning Video Representations from Large Language Models"ā535Updated last year
- An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal ā¦ā363Updated last year
- Language Models Can See: Plugging Visual Controls in Text Generationā259Updated 3 years ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M dā¦ā205Updated last year
- CLIP-like model evaluationā767Updated last month
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)ā308Updated 7 months ago
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.ā938Updated 5 months ago
- Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.ā401Updated 2 months ago
- Open reproduction of MUSE for fast text2image generation.ā356Updated last year
- Research Trends in LLM-guided Multimodal Learning.ā355Updated last year
- (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.ā348Updated 8 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"ā269Updated last year
- [NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"ā524Updated last year
- Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmeticā278Updated 3 years ago
- Large-scale text-video dataset. 10 million captioned short videos.ā657Updated last year
- Official implementation of SEED-LLaMA (ICLR 2024).ā621Updated 11 months ago
- GRiT: A Generative Region-to-text Transformer for Object Understanding (ECCV2024)ā335Updated last year
- Easily create large video dataset from video urlsā633Updated last year
- Densely Captioned Images (DCI) dataset repository.ā192Updated last year
- Open LLaMA Eyes to See the Worldā174Updated 2 years ago