mit-han-lab / streaming-vlmLinks
StreamingVLM: Real-Time Understanding for Infinite Video Streams
☆828Updated 3 months ago
Alternatives and similar repositories for streaming-vlm
Users that are interested in streaming-vlm are comparing it to the libraries listed below
Sorting:
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆155Updated 9 months ago
- Cambrian-S: Towards Spatial Supersensing in Video☆475Updated 3 weeks ago
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆625Updated 2 months ago
- Scaling Vision Pre-Training to 4K Resolution☆217Updated 2 weeks ago
- This is the offical repository of InfiniteVL☆71Updated last month
- Native Multimodal Models are World Learners☆1,399Updated 2 weeks ago
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistant☆376Updated 10 months ago
- DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models☆161Updated 2 weeks ago
- Visual Planning: Let's Think Only with Images☆294Updated 8 months ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆95Updated 2 months ago
- Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the …☆651Updated 2 weeks ago
- [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning☆253Updated 3 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆136Updated 3 weeks ago
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆175Updated 3 weeks ago
- NextStep-1: SOTA Autogressive Image Generation with Continuous Tokens. A research project developed by the StepFun’s Multimodal Intellige…☆597Updated 3 weeks ago
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆128Updated last month
- A Large-scale Video Action Dataset☆162Updated this week
- [ICCV 2025] Video-T1: Test-Time Scaling for Video Generation☆304Updated 6 months ago
- [NeurIPS 2025] Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance☆519Updated 2 weeks ago
- Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models☆166Updated 3 months ago
- [ICCV 2025] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆413Updated last month
- Long Context Transfer from Language to Vision☆398Updated 10 months ago
- Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence☆422Updated this week
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆232Updated 9 months ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆512Updated 5 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆114Updated 5 months ago
- [ICML 2025] Official PyTorch implementation of LongVU☆417Updated 8 months ago
- ☆304Updated this week
- Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)☆684Updated 3 months ago
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆298Updated 3 months ago