VectorSpaceLab / Video-XL
π₯π₯First-ever hour scale video understanding models
β152Updated 2 weeks ago
Related projects β
Alternatives and complementary repositories for Video-XL
- Long Context Transfer from Language to Visionβ328Updated 2 weeks ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"β127Updated 3 months ago
- Explore the Limits of Omni-modal Pretraining at Scaleβ86Updated 2 months ago
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captionsβ151Updated 4 months ago
- β164Updated 4 months ago
- β152Updated 4 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"β144Updated last week
- β131Updated 10 months ago
- β67Updated 6 months ago
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)β142Updated 3 months ago
- β126Updated last week
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).β147Updated last month
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Modelsβ227Updated last month
- Multimodal Models in Real Worldβ400Updated 2 weeks ago
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β205Updated this week
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Modelβ244Updated 4 months ago
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledgeβ121Updated 3 months ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistantβ213Updated 2 months ago
- LVBench: An Extreme Long Video Understanding Benchmarkβ59Updated 2 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Modelsβ51Updated last week
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizerβ197Updated 7 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generationβ132Updated 2 weeks ago
- β¨β¨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Modelsβ137Updated this week
- [ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Modelβ302Updated last week
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Textβ270Updated 2 weeks ago
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Imagesβ318Updated last month
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ178Updated last month
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Modelsβ166Updated last month
- β259Updated last week