princetonvisualai / merv
Unifying Specialized Visual Encoders for Video Language Models
☆13Updated this week
Alternatives and similar repositories for merv:
Users that are interested in merv are comparing it to the libraries listed below
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆32Updated 3 months ago
- IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks☆59Updated 3 months ago
- (NeurIPS 2024 Spotlight) TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment☆26Updated 3 months ago
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆30Updated 6 months ago
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models☆26Updated 2 months ago
- [ECCV 2024] ControlCap: Controllable Region-level Captioning☆59Updated 2 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 6 months ago
- [ECCV2024, Oral, Best Paper Finalist]This is the official implementation of the paper "LEGO: Learning EGOcentric Action Frame Generation …☆34Updated 2 months ago
- Diffusion Powers Video Tokenizer for Comprehension and Generation☆38Updated last month
- Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆54Updated last year
- Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models☆39Updated this week
- ☆42Updated last week
- ☆57Updated last year
- ☆26Updated 5 months ago
- Official Repository of Personalized Visual Instruct Tuning☆26Updated 2 months ago
- Egocentric Video Understanding Dataset (EVUD)☆24Updated 6 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆59Updated 3 months ago
- Official implementation of the paper "Boosting Human-Object Interaction Detection with Text-to-Image Diffusion Model"☆55Updated last year
- Language Repository for Long Video Understanding☆31Updated 6 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆94Updated 2 weeks ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆19Updated 2 weeks ago
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆44Updated 2 months ago
- [CVPR 2024] Data and benchmark code for the EgoExoLearn dataset☆49Updated 4 months ago
- ☆12Updated 2 months ago
- Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models☆57Updated 7 months ago
- Can 3D Vision-Language Models Truly Understand Natural Language?☆21Updated 9 months ago
- Official repo of the paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆23Updated 3 months ago
- [NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"☆32Updated last month
- Code for paper "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning"☆23Updated last year
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 2 months ago