ictnlp / LLaVA-MiniLinks
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
☆496Updated 5 months ago
Alternatives and similar repositories for LLaVA-Mini
Users that are interested in LLaVA-Mini are comparing it to the libraries listed below
Sorting:
- [CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding☆940Updated 8 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆577Updated 3 weeks ago
- ☆403Updated 10 months ago
- Awesome Unified Multimodal Models☆335Updated last month
- Long Context Transfer from Language to Vision☆382Updated 3 months ago
- Explore the Multimodal “Aha Moment” on 2B Model☆594Updated 3 months ago
- A curated list of research based on CLIP.☆230Updated 7 months ago
- MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning☆665Updated last month
- [ICML 2025] Official PyTorch implementation of LongVU☆383Updated last month
- ☆363Updated 4 months ago
- [ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs☆302Updated last month
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆227Updated 9 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆202Updated 2 months ago
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"☆424Updated 2 weeks ago
- Official implementation of UnifiedReward & UnifiedReward-Think☆429Updated last week
- Official Repository of paper OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference☆145Updated 3 months ago
- [Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey☆445Updated 5 months ago
- R1-onevision, a visual language model capable of deep CoT reasoning.☆528Updated 2 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆204Updated 5 months ago
- VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model☆334Updated 2 months ago
- Official repository for VisionZip (CVPR 2025)☆305Updated last month
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆948Updated last week
- Frontier Multimodal Foundation Models for Image and Video Understanding☆858Updated last month
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…☆439Updated 5 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-sta…☆613Updated 2 weeks ago
- ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis☆569Updated last month
- StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding☆126Updated last month
- [ICLR 2025] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts☆227Updated 8 months ago
- MM-IFEngine: Towards Multimodal Instruction Following☆92Updated 2 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆159Updated 6 months ago