mit-han-lab / streaming-vlmLinks
StreamingVLM: Real-Time Understanding for Infinite Video Streams
☆731Updated last month
Alternatives and similar repositories for streaming-vlm
Users that are interested in streaming-vlm are comparing it to the libraries listed below
Sorting:
- Scaling Vision Pre-Training to 4K Resolution☆215Updated 3 months ago
- Native Multimodal Models are World Learners☆1,292Updated this week
- NEO Series: Native Vision-Language Models from First Principles☆225Updated last month
- [ICCV 2025] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆408Updated this week
- ☆573Updated 3 weeks ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆149Updated 8 months ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆94Updated 3 weeks ago
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆591Updated last month
- SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction☆250Updated last month
- Visual Planning: Let's Think Only with Images☆283Updated 6 months ago
- Official PyTorch implementation of TokenSet.☆127Updated 8 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆229Updated 8 months ago
- [ICCV 2025] Video-T1: Test-Time Scaling for Video Generation☆301Updated 5 months ago
- Cosmos-Curate is a powerful video curation system that processes, analyzes, and organizes video content using advanced AI models and dist…☆108Updated this week
- ☆329Updated 3 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆123Updated last month
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆393Updated this week
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆109Updated 3 months ago
- [ICCV 2025] GameFactory: Creating New Games with Generative Interactive Videos☆448Updated 8 months ago
- Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models☆161Updated last month
- [ICML 2025] Official PyTorch implementation of LongVU☆412Updated 6 months ago
- ☆78Updated 7 months ago
- Cambrian-S: Towards Spatial Supersensing in Video☆396Updated 3 weeks ago
- Krea Realtime 14B. An open-source realtime AI video model.☆405Updated 3 weeks ago
- Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the …☆465Updated last week
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistant☆351Updated 8 months ago
- Orient Anything, ICML 2025☆349Updated last month
- (ICCV 2025) ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations☆122Updated 2 weeks ago
- Cosmos-Predict2 is a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world m…☆677Updated last month
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆119Updated 3 weeks ago