facebookresearch / MetaCLIP
ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering
☆1,169Updated 2 weeks ago
Related projects: ⓘ
- DataComp: In search of the next generation of multimodal datasets☆636Updated 8 months ago
- Emu Series: Generative Multimodal Models from BAAI☆1,604Updated 6 months ago
- Hiera: A fast, powerful, and simple hierarchical vision transformer.☆857Updated 6 months ago
- [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…☆740Updated 3 months ago
- [CVPR 2023] Official Implementation of X-Decoder for generalized decoding for pixel, image and language☆1,281Updated 11 months ago
- Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch☆1,192Updated last year
- CLIP-like model evaluation☆584Updated last month
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆502Updated 8 months ago
- This repository contains the official implementation of the research paper, "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinf…☆560Updated last month
- This repository provides the code and model checkpoints of the research paper: Scalable Pre-training of Large Autoregressive Image Model…☆680Updated 4 months ago
- VisionLLM Series☆846Updated this week
- Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.☆2,206Updated 3 weeks ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆692Updated 7 months ago
- A method to increase the speed and lower the memory footprint of existing vision transformers.☆934Updated 3 months ago
- Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation☆1,197Updated last month
- MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.☆895Updated 3 months ago
- A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model To…☆933Updated 2 months ago
- [ECCV2024] Video Foundation Models & Data for Multimodal Understanding☆1,300Updated 3 weeks ago
- Robust fine-tuning of zero-shot models☆629Updated 2 years ago
- 🐟 Code and models for the NeurIPS 2023 paper "Generating Images with Multimodal Language Models".☆415Updated 8 months ago
- Easily compute clip embeddings and build a clip retrieval system with them☆2,355Updated 5 months ago
- ☆556Updated 7 months ago
- A collection of papers on the topic of ``Computer Vision in the Wild (CVinW)''☆1,146Updated 6 months ago
- This is the official repository for the LENS (Large Language Models Enhanced to See) system.☆345Updated 9 months ago
- Official JAX implementation of MAGVIT: Masked Generative Video Transformer☆938Updated 8 months ago
- LLaVA-Interactive-Demo☆344Updated last month
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,423Updated this week
- A family of lightweight multimodal models.☆877Updated 2 weeks ago
- Official code for VisProg (CVPR 2023 Best Paper!)☆683Updated 3 weeks ago
- ☆732Updated 2 months ago