PKU-YuanGroup / LLaVA-CoTLinks

[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning

☆2,084

Alternatives and similar repositories for LLaVA-CoT

Users that are interested in LLaVA-CoT are comparing it to the libraries listed below

Sorting:

StarsfieldAI / R1-V
Witness the aha moment of VLM with less than $3.
☆3,956Updated 5 months ago
LLaVA-VL / LLaVA-NeXT
☆4,321Updated last month
open-compass / VLMEvalKit
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
☆3,216Updated this week
AIDC-AI / Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
☆1,373Updated last month
baaivision / Emu3
Next-Token Prediction is All You Need
☆2,216Updated 7 months ago
EvolvingLMMs-Lab / open-r1-multimodal
A fork to add multimodal model training to open-r1
☆1,409Updated 8 months ago
rhymes-ai / Aria
Codebase for Aria - an Open Multimodal Native MoE
☆1,071Updated 9 months ago
BAAI-DCAI / Bunny
A family of lightweight multimodal models.
☆1,046Updated 11 months ago
2U1 / Qwen-VL-Series-Finetune
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
☆1,270Updated this week
PKU-YuanGroup / MoE-LLaVA
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
☆2,254Updated 3 months ago
DAMO-NLP-SG / VideoLLaMA2
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
☆1,234Updated 8 months ago
DAMO-NLP-SG / VideoLLaMA3
Frontier Multimodal Foundation Models for Image and Video Understanding
☆1,011Updated 2 months ago
cambrian-mllm / cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
☆1,955Updated 11 months ago
MoonshotAI / Kimi-VL
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
☆1,071Updated 3 months ago
AIDC-AI / Marco-o1
An Open Large Reasoning Model for Real-World Solutions
☆1,522Updated 4 months ago
NVlabs / describe-anything
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
☆1,365Updated 3 months ago
facebookresearch / chameleon
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
☆2,060Updated last year
gokayfem / awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
☆1,041Updated 7 months ago
NVlabs / VILA
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…
☆3,602Updated 2 months ago
allenai / molmo
Code for the Molmo Vision-Language Model
☆775Updated 10 months ago
TinyLLaVA / TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
☆910Updated 5 months ago
ByteDance-Seed / Seed1.5-VL
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…
☆1,463Updated 4 months ago
InternLM / InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
☆2,899Updated 4 months ago
Open-Source-O1 / Open-O1
☆1,350Updated 11 months ago
OpenBMB / VisRAG
Parsing-free RAG supported by VLMs
☆818Updated last week
GAIR-NLP / anole
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
☆810Updated 4 months ago
X-PLUG / mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
☆2,249Updated 4 months ago
microsoft / LLM2CLIP
LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.
☆554Updated 3 months ago
microsoft / Magma
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
☆1,827Updated 2 weeks ago
yunlong10 / Awesome-LLMs-for-Video-Understanding
🔥🔥🔥 [IEEE TCSVT] Latest Papers, Codes and Datasets on Vid-LLMs.
☆2,841Updated 2 weeks ago