kahnchana / mvu
🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)
☆37Updated 3 months ago
Alternatives and similar repositories for mvu:
Users that are interested in mvu are comparing it to the libraries listed below
- Language Repository for Long Video Understanding☆31Updated 10 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆50Updated 4 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆43Updated 3 months ago
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆59Updated 3 months ago
- [ECCV2024, Oral, Best Paper Finalist]This is the official implementation of the paper "LEGO: Learning EGOcentric Action Frame Generation …☆37Updated 2 months ago
- ☆31Updated 2 weeks ago
- [CVPR 2024] Data and benchmark code for the EgoExoLearn dataset☆57Updated 8 months ago
- ☆61Updated last year
- [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models …☆56Updated last month
- Official code repo of PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs☆26Updated 3 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆109Updated 2 months ago
- IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks☆58Updated 7 months ago
- ☆80Updated last month
- Egocentric Video Understanding Dataset (EVUD)☆29Updated 10 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers"☆64Updated last month
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆28Updated 6 months ago
- Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"☆95Updated 6 months ago
- Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]☆61Updated 3 weeks ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆38Updated last month
- Official Implementation of ISR-DPO:Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO (AAAI'25)☆18Updated 2 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆21Updated last month
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆69Updated 2 months ago
- ☆28Updated 3 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆64Updated 7 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆59Updated 10 months ago
- ☆33Updated 2 months ago
- [ICML 2024] A Touch, Vision, and Language Dataset for Multimodal Alignment☆72Updated 2 months ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆24Updated 4 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆61Updated 9 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆75Updated 3 weeks ago