[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
☆160Jul 12, 2025Updated 7 months ago
Alternatives and similar repositories for HourVideo
Users that are interested in HourVideo are comparing it to the libraries listed below
Sorting:
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆141Aug 21, 2025Updated 6 months ago
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆21Feb 27, 2025Updated last year
- MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment☆35Jul 1, 2024Updated last year
- ☆16Sep 25, 2025Updated 5 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆831Dec 14, 2025Updated 2 months ago
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling☆511Nov 18, 2025Updated 3 months ago
- Official repo and evaluation implementation of VSI-Bench☆675Aug 5, 2025Updated 6 months ago
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modeling☆41Feb 12, 2025Updated last year
- Preference Learning for LLaVA☆59Nov 9, 2024Updated last year
- [SCIS 2024] The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Di…☆62Nov 7, 2024Updated last year
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆86Mar 21, 2024Updated last year
- This repository contains the implementation of the paper: "ChatCam: Empowering Camera Control through Conversational AI", NeurIPS 2024.☆21Nov 15, 2024Updated last year
- ☆128Nov 1, 2025Updated 4 months ago
- ☆31Sep 19, 2025Updated 5 months ago
- [CVPR2025] Code Release of F-LMM: Grounding Frozen Large Multimodal Models☆108May 29, 2025Updated 9 months ago
- TStar is a unified temporal search framework for long-form video question answering☆88Sep 2, 2025Updated 6 months ago
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation☆186Nov 6, 2025Updated 3 months ago
- Visual Spatial Tuning☆176Feb 19, 2026Updated 2 weeks ago
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)☆859Jul 29, 2024Updated last year
- ✨First Open-Source R1-like Video-LLM [2025/02/18]☆381Feb 23, 2025Updated last year
- [ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs☆310May 21, 2025Updated 9 months ago
- (ACL 2025) MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale☆49Jun 4, 2025Updated 9 months ago
- Some LaTeX Tips for Writing Research Papers☆10May 30, 2016Updated 9 years ago
- [ICML‘25] Official code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training an…☆13Apr 17, 2025Updated 10 months ago
- ☆13Apr 23, 2025Updated 10 months ago
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning☆234Jan 22, 2026Updated last month
- ☆11Jan 8, 2025Updated last year
- ☆12Sep 23, 2024Updated last year
- [CVPR 2025] Program synthesis for 3D spatial reasoning☆58Jun 16, 2025Updated 8 months ago
- Code for [RA-L] High-Fidelity Simulated Data Generation for Real-World Zero-Shot Robotic Manipulation Learning with Gaussian Splatting☆51Feb 25, 2026Updated last week
- Accelerating Vision-Language Pretraining with Free Language Modeling (CVPR 2023)☆32May 15, 2023Updated 2 years ago
- [NIPS 25'] Evaluation code of paper "KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models"☆40Oct 19, 2025Updated 4 months ago
- The repo of the paper: Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medic…☆11May 26, 2023Updated 2 years ago
- [ECCV 2022] MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes official implementation☆16Feb 2, 2023Updated 3 years ago
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Mo…☆86Jan 27, 2025Updated last year
- SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability☆16May 8, 2025Updated 9 months ago
- SODA: Story Oriented Dense Video Captioning Evaluation Framework☆13May 3, 2024Updated last year
- A Massive Multi-Discipline Lecture Understanding Benchmark☆33Nov 1, 2025Updated 4 months ago
- The official implement of "Grounded Chain-of-Thought for Multimodal Large Language Models"☆21Jul 21, 2025Updated 7 months ago