kahnchana / mvuLinks

🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)

☆49

Alternatives and similar repositories for mvu

Users that are interested in mvu are comparing it to the libraries listed below

Sorting:

kkahatapitiya / LangRepo
Language Repository for Long Video Understanding
☆32Updated last year
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆70Updated last year
CeeZh / LLoVi
Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
☆101Updated 11 months ago
yonseivnl / vlm-rlaif
ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
☆75Updated last year
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆60Updated 3 months ago
alanaai / EVUD
Egocentric Video Understanding Dataset (EVUD)
☆31Updated last year
yihedeng9 / OpenVLThinker
OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement
☆113Updated 3 months ago
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆53Updated 3 months ago
wxh1996 / VideoAgent
☆117Updated 6 months ago
eric-ai-lab / MMWorld
Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
☆29Updated 3 months ago
UX-Decoder / FIND
[NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"
☆125Updated last year
jongwoopark7978 / LVNet
☆36Updated 7 months ago
NVlabs / LITA
☆186Updated last year
LilyDaytoy / OpenPVSG
Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23
☆98Updated last year
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆94Updated last year
TencentARC / SEED-Bench-R1
☆90Updated 4 months ago
HaroldChen19 / VistaDPO
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆34Updated 4 months ago
cliangyu / Cola
[NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"
☆103Updated last year
Gabesarch / grounded-rl
☆97Updated 3 months ago
UMass-Embodied-AGI / CoVLM
[ICLR 2023] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
☆45Updated 4 months ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆49Updated 3 months ago
BolinLai / LEGO
[ECCV2024, Oral, Best Paper Finalist] This is the official implementation of the paper "LEGO: Learning EGOcentric Action Frame Generation…
☆38Updated 8 months ago
ziplab / LongVLM
☆104Updated last year
amazon-science / QA-ViT
☆69Updated last year
imagegridworth / IG-VLM
☆138Updated last year
showlab / VideoGUI
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
☆45Updated 4 months ago
bigai-nlco / VideoLLaMB
[ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
☆77Updated 7 months ago
jh-yi / Video-Panda
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [CVPR 2025]
☆73Updated 4 months ago
tulip-berkeley / open_clip
An open source implementation of CLIP (With TULIP Support)
☆162Updated 5 months ago
WHB139426 / Grounded-Video-LLM
[EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
☆130Updated 2 months ago