HaroldChen19 / VistaDPOLinks
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆35Updated 3 months ago
Alternatives and similar repositories for VistaDPO
Users that are interested in VistaDPO are comparing it to the libraries listed below
Sorting:
- ☆90Updated 3 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated last month
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆73Updated 2 months ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆122Updated last month
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆150Updated 2 weeks ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆60Updated 2 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆128Updated 4 months ago
- [ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models☆86Updated last year
- PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos☆45Updated 3 months ago
- Official Implementation of OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation☆32Updated 3 months ago
- ☆75Updated 3 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆87Updated 7 months ago
- Official implement of MIA-DPO☆66Updated 8 months ago
- Quick Long Video Understanding☆64Updated 3 months ago
- Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’☆57Updated 3 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆70Updated last year
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆74Updated 8 months ago
- ☆60Updated last month
- Structured Video Comprehension of Real-World Shorts☆204Updated 2 weeks ago
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆61Updated 7 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆50Updated last year
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆77Updated 7 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated last year
- [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆57Updated last month
- Official Repository of Personalized Visual Instruct Tuning☆32Updated 7 months ago
- 🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training☆52Updated this week
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆65Updated last month
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆40Updated 7 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆40Updated 3 months ago
- ICML 2025 - Impossible Videos☆74Updated 2 months ago