mit-han-lab / streaming-vlmLinks
StreamingVLM: Real-Time Understanding for Infinite Video Streams
☆771Updated 2 months ago
Alternatives and similar repositories for streaming-vlm
Users that are interested in streaming-vlm are comparing it to the libraries listed below
Sorting:
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆607Updated last month
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆153Updated 9 months ago
- ☆579Updated last month
- This is the offical repository of InfiniteVL☆54Updated last week
- Scaling Vision Pre-Training to 4K Resolution☆217Updated 3 months ago
- Native Multimodal Models are World Learners☆1,367Updated 3 weeks ago
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆157Updated last month
- ☆156Updated last week
- Cambrian-S: Towards Spatial Supersensing in Video☆429Updated last week
- Visual Planning: Let's Think Only with Images☆287Updated 7 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆132Updated this week
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistant☆368Updated 9 months ago
- Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)☆680Updated 3 months ago
- [ICCV 2025] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆410Updated 3 weeks ago
- Long Context Transfer from Language to Vision☆399Updated 9 months ago
- Official PyTorch implementation of TokenSet.☆127Updated 9 months ago
- [ICCV 2025] Video-T1: Test-Time Scaling for Video Generation☆303Updated 5 months ago
- Fully Open Framework for Democratized Multimodal Training☆662Updated last week
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆411Updated 8 months ago
- [ICML 2025] Official PyTorch implementation of LongVU☆412Updated 7 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆231Updated 8 months ago
- The official repository of "Astra : General Interactive World Model with Autoregressive Denoising"☆172Updated this week
- Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models☆164Updated 2 months ago
- [ArXiv 2025] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models☆68Updated last week
- LongLive: Real-time Interactive Long Video Generation☆925Updated 3 weeks ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆113Updated 4 months ago
- ☆345Updated 4 months ago
- Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the …☆540Updated this week
- [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning☆249Updated 2 months ago
- PyTorch implementation of NEPA☆70Updated this week