mit-han-lab / streaming-vlmLinks
StreamingVLM: Real-Time Understanding for Infinite Video Streams
☆667Updated 3 weeks ago
Alternatives and similar repositories for streaming-vlm
Users that are interested in streaming-vlm are comparing it to the libraries listed below
Sorting:
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆400Updated last week
- NEO Series: Native Vision-Language Models from First Principles☆222Updated 2 weeks ago
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆144Updated 2 months ago
- Native Multimodal Models are World Learners☆1,178Updated this week
- [ICCV 2025] Video-T1: Test-Time Scaling for Video Generation☆296Updated 4 months ago
- SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction☆234Updated 3 weeks ago
- Scaling Vision Pre-Training to 4K Resolution☆211Updated 2 months ago
- ☆565Updated 3 weeks ago
- ☆78Updated 6 months ago
- Krea Realtime 14B. An open-source realtime AI video model.☆359Updated last week
- Official PyTorch implementation of TokenSet.☆126Updated 7 months ago
- Cosmos-Curate is a powerful video curation system that processes, analyzes, and organizes video content using advanced AI models and dist…☆101Updated last week
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆103Updated 2 weeks ago
- ☆323Updated 2 months ago
- [ICCV 2025] OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆402Updated last month
- Generate large-scale explorable 3D scenes with high-quality panorama videos from a single image or text prompt.☆552Updated last month
- 🔥 Official impl. of "DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction"☆159Updated 4 months ago
- [arXiv] On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices☆124Updated 3 months ago
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆416Updated 3 weeks ago
- Visual Planning: Let's Think Only with Images☆279Updated 5 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆142Updated 7 months ago
- LongLive: Real-time Interactive Long Video Generation☆789Updated last week
- One-shot and Few-shot 3D Editing without Per-Scene Optimization☆159Updated 2 months ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆88Updated this week
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆107Updated 3 months ago
- ☆35Updated 9 months ago
- [ICCV 2025] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion☆287Updated 3 months ago
- Long Context Transfer from Language to Vision☆396Updated 7 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆228Updated 7 months ago
- Fully Open Framework for Democratized Multimodal Training☆601Updated last week