HaroldChen19 / VistaDPOLinks
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆27Updated last week
Alternatives and similar repositories for VistaDPO
Users that are interested in VistaDPO are comparing it to the libraries listed below
Sorting:
- ☆84Updated 2 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆51Updated 3 weeks ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆70Updated last week
- [CVPR 25] A framework named B^2-DiffuRL for RL-based diffusion model fine-tuning.☆30Updated 2 months ago
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆45Updated 4 months ago
- ☆49Updated 2 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 5 months ago
- ☆37Updated last month
- The code repository of UniRL☆30Updated 3 weeks ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 10 months ago
- ☆21Updated 2 months ago
- Official Repository of Personalized Visual Instruct Tuning☆29Updated 3 months ago
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆27Updated 2 months ago
- Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better☆29Updated last week
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modeling☆35Updated 4 months ago
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆49Updated 8 months ago
- [CVPR2025] A benchmark for evaluating video generative models in generating short stories☆15Updated last month
- Quick Long Video Understanding☆55Updated last week
- On Path to Multimodal Generalist: General-Level and General-Bench☆14Updated last month
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆75Updated 3 months ago
- Fast-Slow Thinking for Large Vision-Language Model Reasoning☆15Updated last month
- ☆51Updated 2 months ago
- Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows☆14Updated 2 weeks ago
- Code for "VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement"☆47Updated 6 months ago
- ☆23Updated last year
- ☆17Updated last week
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆65Updated 2 months ago
- VisRL: Intention-Driven Visual Perception via Reinforced Reasoning☆29Updated last week
- [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆41Updated 2 weeks ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆62Updated 8 months ago