Mark12Ding / Dispider
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
☆79Updated last month
Alternatives and similar repositories for Dispider:
Users that are interested in Dispider are comparing it to the libraries listed below
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆41Updated last month
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆40Updated last month
- [NeurIPS 2024] Official PyTorch Implementation of "FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner"☆64Updated 4 months ago
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆83Updated 2 months ago
- ☆159Updated 4 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 6 months ago
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆128Updated 2 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆93Updated 6 months ago
- ☆66Updated 2 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆75Updated 3 weeks ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆103Updated last month
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆124Updated 3 months ago
- OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆27Updated 3 weeks ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆150Updated last month
- [NeurIPS2024] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆29Updated 2 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆61Updated 5 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆59Updated 7 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆159Updated last month
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆27Updated 2 weeks ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆19Updated 2 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆49Updated 3 weeks ago
- Official implement of MIA-DPO☆49Updated 3 weeks ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆59Updated 5 months ago
- Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interact…☆27Updated 2 weeks ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆67Updated 2 months ago
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds