mbzuai-oryx / Video-LLaVAView external linksLinks
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
β261Aug 5, 2025Updated 6 months ago
Alternatives and similar repositories for Video-LLaVA
Users that are interested in Video-LLaVA are comparing it to the libraries listed below
Sorting:
- [ACL 2024 π₯] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capβ¦β1,488Aug 5, 2025Updated 6 months ago
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β943Aug 5, 2025Updated 6 months ago
- [CVPR 2025 π₯]A Large Multimodal Model for Pixel-Level Visual Grounding in Videosβ96Apr 14, 2025Updated 10 months ago
- [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understandingβ409May 8, 2025Updated 9 months ago
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β294Jun 13, 2024Updated last year
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,447Dec 3, 2024Updated last year
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understandingβ292Aug 5, 2025Updated 6 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videosβ145Dec 26, 2024Updated last year
- [CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understandingβ682Jan 29, 2025Updated last year
- [NAACL'25] Contains code and documentation for our VANE-Bench paper.β17Aug 19, 2025Updated 5 months ago
- [CVPRW-25 MMFM] Official repository of paper titled "How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite foβ¦β50Aug 23, 2024Updated last year
- [CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.β3,336Jan 18, 2025Updated last year
- β138Sep 29, 2024Updated last year
- [NeurIPS 2023] Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalizationβ110Feb 11, 2024Updated 2 years ago
- π₯π₯π₯ [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.β3,066Dec 20, 2025Updated last month
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Promptsβ336Jul 17, 2024Updated last year
- β424Jul 29, 2024Updated last year
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videosβ46Apr 29, 2024Updated last year
- β42Nov 9, 2023Updated 2 years ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"β129Aug 21, 2024Updated last year
- [CVPR 2024 Highlightπ₯] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understandingβ944Oct 16, 2024Updated last year
- [EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understandingβ3,124Jun 4, 2024Updated last year
- [ECCV 2024π₯] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"β150Sep 10, 2024Updated last year
- [ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modelingβ503Nov 18, 2025Updated 2 months ago
- Official repo of Griffon series including v1(ECCV 2024), v2(ICCV 2025), G, and R, and also the RL tool Vision-R1.β249Aug 12, 2025Updated 6 months ago
- [ECCV2024] Video Foundation Models & Data for Multimodal Understandingβ2,196Dec 15, 2025Updated last month
- γICLR 2024π₯ γ Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignmentβ866Mar 25, 2024Updated last year
- β110Dec 23, 2022Updated 3 years ago
- Official PyTorch code of GroundVQA (CVPR'24)β64Sep 13, 2024Updated last year
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)β859Jul 29, 2024Updated last year
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!β137Dec 31, 2023Updated 2 years ago
- Official implementation of 'CLIP-DINOiser: Teaching CLIP a few DINO tricks' paper.β274Oct 26, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactionsβ2,921May 26, 2025Updated 8 months ago
- [CVPRW 2025] Official repository of paper titled "Towards Evaluating the Robustness of Visual State Space Models"β25Jun 8, 2025Updated 8 months ago
- [CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformersβ192Sep 24, 2023Updated 2 years ago
- A new multi-task learning framework using Vision Transformersβ11Jun 19, 2024Updated last year
- [MICCAI 2025] Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathologyβ12Jun 17, 2025Updated 7 months ago
- [CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".β305Apr 3, 2024Updated last year
- β155Oct 31, 2024Updated last year