PKU-YuanGroup / LLaVA-CoT
LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
☆1,973Updated 3 weeks ago
Alternatives and similar repositories for LLaVA-CoT:
Users that are interested in LLaVA-CoT are comparing it to the libraries listed below
- Witness the aha moment of VLM with less than $3.☆3,622Updated 2 months ago
- A fork to add multimodal model training to open-r1☆1,245Updated 2 months ago
- ☆3,763Updated 2 months ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆900Updated last month
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆2,304Updated this week
- ☆1,356Updated 5 months ago
- Next-Token Prediction is All You Need☆2,106Updated last month
- Solve Visual Understanding with Reinforced VLMs☆4,860Updated 2 weeks ago
- EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL☆2,258Updated this week
- Frontier Multimodal Foundation Models for Image and Video Understanding☆768Updated 2 weeks ago
- An Open Large Reasoning Model for Real-World Solutions☆1,488Updated 2 months ago
- An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.☆685Updated this week
- Famous Vision Language Models and Their Architectures☆803Updated 2 months ago
- ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,256Updated last month
- MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning☆590Updated this week
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆809Updated 2 weeks ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,898Updated 6 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,152Updated 3 months ago
- Official repository of ’Visual-RFT: Visual Reinforcement Fine-Tuning’☆1,627Updated 2 weeks ago
- Large Reasoning Models☆804Updated 5 months ago
- R1-onevision, a visual language model capable of deep CoT reasoning.☆513Updated 3 weeks ago
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…☆3,201Updated last week
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,820Updated last week
- Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation☆756Updated 9 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-sta…☆540Updated 3 weeks ago
- ☆358Updated 2 months ago
- ☆857Updated last month
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,623Updated this week
- Explore the Multimodal “Aha Moment” on 2B Model☆583Updated last month
- VisionLLM Series☆1,054Updated 2 months ago