mit-han-lab / streaming-vlmView external linksLinks
StreamingVLM: Real-Time Understanding for Infinite Video Streams
☆872Oct 15, 2025Updated 3 months ago
Alternatives and similar repositories for streaming-vlm
Users that are interested in streaming-vlm are comparing it to the libraries listed below
Sorting:
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)☆418Oct 29, 2025Updated 3 months ago
- D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI☆68Jan 15, 2026Updated 3 weeks ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆87Nov 29, 2025Updated 2 months ago
- [ICLR'25] Streaming Video Question-Answering with In-context Video KV-Cache Retrieval☆99Nov 4, 2025Updated 3 months ago
- Scaling Zero-Shot Reference-to-Video Generation☆63Dec 11, 2025Updated 2 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"☆269Oct 15, 2025Updated 3 months ago
- ☆20Jul 28, 2025Updated 6 months ago
- [ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention☆627Feb 3, 2026Updated last week
- Official repository for the paper "MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars"☆41Nov 20, 2025Updated 2 months ago
- ☆24Nov 1, 2024Updated last year
- The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"☆81Oct 15, 2025Updated 3 months ago
- Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks☆36Nov 27, 2025Updated 2 months ago
- A unified inference and post-training framework for accelerated video generation.☆3,059Updated this week
- Provides an interface for extensions to use language models directly in the browser.☆15Updated this week
- Long Context Research☆26Jan 26, 2026Updated 2 weeks ago
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆120Jul 24, 2025Updated 6 months ago
- Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"☆29Apr 16, 2024Updated last year
- ☆13Aug 26, 2024Updated last year
- Official Repository for paper "HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding"☆57Jan 23, 2026Updated 3 weeks ago
- Code for paper "CLiFT: Compressive Light-Field Tokens for Compute Efficient and Adaptive Neural Rendering" [NeurIPS 2025 (spotlight)]☆75Aug 2, 2025Updated 6 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆83Feb 27, 2025Updated 11 months ago
- [ICLR 2026] LongLive: Real-time Interactive Long Video Generation☆1,040Jan 27, 2026Updated 2 weeks ago
- Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”☆122Updated this week
- [ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics☆38Sep 10, 2025Updated 5 months ago
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,895Jan 22, 2026Updated 3 weeks ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆55Jul 1, 2025Updated 7 months ago
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistant☆390Mar 19, 2025Updated 10 months ago
- (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding☆345Jul 19, 2024Updated last year
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models☆37Nov 10, 2024Updated last year
- ☆82Oct 13, 2025Updated 4 months ago
- Official PyTorch implementation of the paper Transformer-Based Image Generation from Scene Graphs https://arxiv.org/abs/2303.04634☆19Jan 30, 2024Updated 2 years ago
- Sirius-Fleet: Multi-Task Interactive Robot Fleet Learning with Visual World Models☆17Mar 12, 2025Updated 11 months ago
- [ICML 2025 Oral] An official implementation of VideoRoPE & VideoRoPE++☆216Feb 2, 2026Updated last week
- [CVPR 2025] Official PyTorch implementation of "EdgeTAM: On-Device Track Anything Model"☆871Jan 27, 2026Updated 2 weeks ago
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…☆3,737Nov 28, 2025Updated 2 months ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"☆833Jan 28, 2026Updated 2 weeks ago
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆176Sep 26, 2024Updated last year
- ☆4,552Sep 14, 2025Updated 5 months ago
- Official Pytorch Implementation for "Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising"☆336Updated this week