The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]
☆21Feb 27, 2025Updated last year
Alternatives and similar repositories for VISTA
Users that are interested in VISTA are comparing it to the libraries listed below
Sorting:
- More reliable Video Understanding Evaluation☆14Sep 23, 2025Updated 5 months ago
- The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search" [EMNLP25]☆38Feb 1, 2026Updated last month
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆101Jul 28, 2025Updated 7 months ago
- ☆37Sep 16, 2024Updated last year
- VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning☆35Jul 15, 2025Updated 7 months ago
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆54Mar 9, 2025Updated 11 months ago
- DreamDance: Personalized Text-to-video Generation by Combining Text-to-Image Synthesis and Motion Transfer☆14Dec 16, 2022Updated 3 years ago
- LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding☆34Jan 16, 2026Updated last month
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Feb 13, 2025Updated last year
- ☆37Nov 8, 2024Updated last year
- [EMNLP 2025 Main] Official implementation of VRoPE: Rotary Position Embedding for Video Large Language Models.☆27Nov 18, 2025Updated 3 months ago
- TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of u…☆25Jun 4, 2025Updated 8 months ago
- ABC: Achieving Better Control of Multimodal Embeddings using VLMs [TMLR2025]☆20Aug 21, 2025Updated 6 months ago
- ☆16Jul 23, 2024Updated last year
- Unifying Specialized Visual Encoders for Video Language Models☆25Nov 22, 2025Updated 3 months ago
- [CVPR2025] VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding☆24Mar 24, 2025Updated 11 months ago
- Test-time Scaling for VAR models☆31Sep 19, 2025Updated 5 months ago
- ☆21Apr 17, 2025Updated 10 months ago
- ☆59Mar 3, 2025Updated 11 months ago
- An object detection codebase based on MegEngine.☆28Dec 14, 2022Updated 3 years ago
- ☆21Oct 31, 2024Updated last year
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆87Jul 13, 2025Updated 7 months ago
- For Ego4D VQ3D Task☆22Jan 9, 2024Updated 2 years ago
- [NeurIPS 2025] ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression☆50Nov 4, 2025Updated 3 months ago
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆12Jun 28, 2025Updated 8 months ago
- A Text2SQL benchmark for evaluation of Large Language Models☆41Updated this week
- FreeVA: Offline MLLM as Training-Free Video Assistant☆69Jun 9, 2024Updated last year
- [CVPR 2025 AI4CC Workshop] Official Implementation of HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editin…☆35May 8, 2025Updated 9 months ago
- Accelerating Vision-Language Pretraining with Free Language Modeling (CVPR 2023)☆32May 15, 2023Updated 2 years ago
- Open-Pandora: On-the-fly Control Video Generation☆35Nov 28, 2024Updated last year
- ☆18Jun 10, 2025Updated 8 months ago
- ☆36Feb 12, 2025Updated last year
- [NeurIPS ENLSP Workshop'24] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios☆16Oct 18, 2024Updated last year
- Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025☆33May 1, 2025Updated 9 months ago
- Long Context Transfer from Language to Vision☆402Mar 18, 2025Updated 11 months ago
- [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding☆160Jul 12, 2025Updated 7 months ago
- [AAAI 2026] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models☆38Jan 27, 2026Updated last month
- Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"☆34Jul 12, 2024Updated last year
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆39Jun 14, 2025Updated 8 months ago