facebookresearch / selective-vqa_ood
Implementation for the CVPR 2023 paper "Improving Selective Visual Question Answering by Learning from Your Peers" (https://arxiv.org/abs/2306.08751)
☆23Updated last year
Related projects: ⓘ
- ☆45Updated 2 months ago
- ☆46Updated 10 months ago
- ☆64Updated 11 months ago
- [ICML 2024] This repository includes the official implementation of our paper "Rejuvenating image-GPT as Strong Visual Representation Lea…☆96Updated 4 months ago
- ☆31Updated 3 months ago
- Matryoshka Multimodal Models☆67Updated 3 weeks ago
- How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges☆30Updated 11 months ago
- (ACL'2023) MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning☆34Updated last month
- Official Pytorch Implementation of Self-emerging Token Labeling☆30Updated 5 months ago
- Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models☆40Updated 3 months ago
- Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".☆54Updated last year
- ☆22Updated last month
- [CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings☆43Updated last year
- ☆40Updated 4 months ago
- VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)☆21Updated 2 months ago
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆85Updated 5 months ago
- ☆100Updated last month
- ☆80Updated 4 months ago
- [CBMI2024] Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".☆17Updated 2 months ago
- MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria☆49Updated last month
- Code for experiments for "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"☆92Updated last week
- ☆29Updated last year
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆73Updated 2 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆39Updated last week
- Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆38Updated last year
- Official repository of paper "Subobject-level Image Tokenization"☆58Updated 4 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆27Updated this week
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆15Updated last week
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".☆40Updated 2 months ago