ilkerkesen / ViLMA
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)
☆14Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for ViLMA
- Code and datasets for "Text encoders are performance bottlenecks in contrastive vision-language models". Coming soon!☆11Updated last year
- [CVPR23 Highlight] CREPE: Can Vision-Language Foundation Models Reason Compositionally?☆32Updated last year
- Code and datasets for "What’s “up” with vision-language models? Investigating their struggle with spatial reasoning".☆34Updated 8 months ago
- [ICCV23] Official implementation of eP-ALM: Efficient Perceptual Augmentation of Language Models.☆27Updated last year
- Language Repository for Long Video Understanding☆29Updated 5 months ago
- Official implementation of "Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data" (ICLR 2024)☆28Updated last month
- If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions☆13Updated 7 months ago
- Code for "Are “Hierarchical” Visual Representations Hierarchical?" in NeurIPS Workshop for Symmetry and Geometry in Neural Representation…☆19Updated last year
- https://arxiv.org/abs/2209.15162☆48Updated last year
- Holistic evaluation of multimodal foundation models☆41Updated 3 months ago
- Official This-Is-My Dataset published in CVPR 2023☆15Updated 4 months ago
- VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆43Updated 11 months ago
- ☆29Updated 2 years ago
- ☕️ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆27Updated 5 months ago
- Official code for our CVPR 2023 paper: Test of Time: Instilling Video-Language Models with a Sense of Time☆45Updated 5 months ago
- ☆19Updated last month
- Multimodal Video Understanding Framework (MVU)☆24Updated 6 months ago
- Do Vision and Language Models Share Concepts? A Vector Space Alignment Study☆12Updated this week
- ☆29Updated last year
- COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!☆22Updated this week
- [ICLR 23] Contrastive Aligned of Vision to Language Through Parameter-Efficient Transfer Learning☆37Updated last year
- Video descriptions of research papers relating to foundation models and scaling☆30Updated last year
- Code for paper: VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning☆29Updated 7 months ago
- ☆30Updated 9 months ago
- Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image …☆55Updated last month
- [CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings☆44Updated last year
- Implementation of Bitune: Bidirectional Instruction-Tuning☆15Updated 5 months ago
- ☆33Updated 4 months ago
- (WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, B…☆81Updated 2 months ago
- Repo for paper: "Paxion: Patching Action Knowledge in Video-Language Foundation Models" Neurips 23 Spotlight☆35Updated last year