DAMO-NLP-SG / VideoLLaMA3Links
Frontier Multimodal Foundation Models for Image and Video Understanding
β946Updated last week
Alternatives and similar repositories for VideoLLaMA3
Users that are interested in VideoLLaMA3 are comparing it to the libraries listed below
Sorting:
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsβ1,211Updated 7 months ago
- π₯π₯First-ever hour scale video understanding modelsβ525Updated last month
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with gβ¦β448Updated last week
- Official repository for the paper PLLaVAβ665Updated last year
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ459Updated 2 months ago
- [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioningβ1,311Updated last month
- [ICML 2025] Official PyTorch implementation of LongVUβ393Updated 3 months ago
- β¨β¨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysisβ623Updated 3 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [π₯the first paper to explore R1 for video]β667Updated 3 weeks ago
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understandingβ627Updated 8 months ago
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editingβ564Updated 10 months ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.β1,232Updated this week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ237Updated 11 months ago
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving statβ¦β1,394Updated 2 months ago
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β832Updated last year
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ389Updated 3 months ago
- π‘ VideoMind: A Chain-of-LoRA Agent for Long Video Reasoningβ246Updated last month
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.β536Updated last month
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β221Updated last month
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ642Updated 6 months ago
- β365Updated 6 months ago
- MiMo-VLβ516Updated this week
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)β240Updated 8 months ago
- Code for the Molmo Vision-Language Modelβ720Updated 8 months ago
- Official implementation of BLIP3o-Seriesβ1,420Updated last week
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understandingβ284Updated 2 weeks ago
- Long Context Transfer from Language to Visionβ390Updated 5 months ago
- GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learningβ1,484Updated this week
- An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.β1,062Updated last week
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,013Updated 2 weeks ago