ictnlp / LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
☆399Updated 2 months ago
Alternatives and similar repositories for LLaVA-Mini:
Users that are interested in LLaVA-Mini are comparing it to the libraries listed below
- ☆393Updated 7 months ago
- [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding☆917Updated 4 months ago
- Rethinking Step-by-step Visual Reasoning in LLMs☆268Updated last month
- Long Context Transfer from Language to Vision☆367Updated 3 months ago
- ☆360Updated last week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆206Updated 5 months ago
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆144Updated last month
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆154Updated 2 months ago
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆295Updated 2 weeks ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆394Updated last month
- A curated list of research based on CLIP.☆180Updated 3 months ago
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).☆153Updated 5 months ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆312Updated 3 weeks ago
- ☆375Updated 3 months ago
- [ICLR 2025] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts☆194Updated 4 months ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer☆369Updated 2 months ago
- ☆105Updated 7 months ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…☆377Updated 2 months ago
- StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding☆112Updated this week
- When do we not need larger vision models?☆373Updated last month
- ☆333Updated last month
- MoH: Multi-Head Attention as Mixture-of-Head Attention☆215Updated 4 months ago