PKU-YuanGroup / LLaVA-CoT
LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
☆1,832Updated last month
Alternatives and similar repositories for LLaVA-CoT:
Users that are interested in LLaVA-CoT are comparing it to the libraries listed below
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding☆2,104Updated last month
- Next-Token Prediction is All You Need☆2,004Updated 3 months ago
- ☆3,423Updated last week
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆1,856Updated this week
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆660Updated this week
- An Open Large Reasoning Model for Real-World Solutions☆1,444Updated 2 months ago
- ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,093Updated last week
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.☆1,930Updated 6 months ago
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…☆2,916Updated last week
- A Self-adaptation Framework🐙 that adapts LLMs for unseen tasks in real-time!☆948Updated 3 weeks ago
- ☆1,334Updated 3 months ago
- Large Reasoning Models☆801Updated 2 months ago
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,755Updated last month
- Parsing-free RAG supported by VLMs☆593Updated this week
- A family of lightweight multimodal models.☆987Updated 3 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,078Updated 3 weeks ago
- A Framework of Small-scale Large Multimodal Models☆745Updated 3 weeks ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,854Updated 3 months ago
- 🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.☆1,951Updated 3 weeks ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆470Updated last month
- The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.☆1,492Updated this week
- ☆2,197Updated this week
- GPT4V-level open-source multi-modal model based on Llama3-8B☆2,262Updated 5 months ago
- Codebase for Aria - an Open Multimodal Native MoE☆998Updated last month
- 4M: Massively Multimodal Masked Modeling☆1,686Updated this week
- ☆320Updated last week