zzxslp / SoM-LLaVALinks

[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

☆144

Alternatives and similar repositories for SoM-LLaVA

Users that are interested in SoM-LLaVA are comparing it to the libraries listed below

Sorting:

X2FD / LVIS-INSTRUCT4V
☆133Updated last year
isekai-portal / Link-Context-Learning
☆99Updated last year
RifleZhang / LLaVA-Hound-DPO
☆155Updated last year
imagegridworth / IG-VLM
☆138Updated last year
WisconsinAIVision / ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
☆331Updated last year
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆94Updated last year
yuhui-zh15 / VLMClassifier
Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)
☆91Updated last year
llyx97 / TempCompass
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …
☆125Updated 6 months ago
zeyofu / BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…
☆147Updated last month
bronyayang / Law_of_Vision_Representation_in_MLLMs
[COLM'25] Official implementation of the Law of Vision Representation in MLLMs
☆168Updated 3 weeks ago
mbzuai-oryx / Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
☆259Updated 2 months ago
amazon-science / QA-ViT
☆69Updated last year
yonseivnl / vlm-rlaif
ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
☆75Updated last year
TIGER-AI-Lab / Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]
☆231Updated 7 months ago
MMStar-Benchmark / MMStar
[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
☆196Updated last year
opendatalab / HA-DPO
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
☆95Updated last year
luogen1996 / LLaVA-HR
[ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant
☆244Updated last year
dvlab-research / Prompt-Highlighter
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
☆155Updated last year
PKU-YuanGroup / Video-Bench
A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!
☆135Updated last year
AFeng-x / Draw-and-Understand
[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
☆88Updated 4 months ago
TempleX98 / MoVA
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
☆167Updated last year
icoz69 / StableLLAVA
Official repo for StableLLAVA
☆94Updated last year
AoiDragon / POPE
[EMNLP'23] The official GitHub page for ''Evaluating Object Hallucination in Large Vision-Language Models''
☆94Updated 2 months ago
BAAI-DCAI / DataOptim
A collection of visual instruction tuning datasets.
☆76Updated last year
SHI-Labs / CuMo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
☆157Updated last year
cliangyu / Cola
[NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"
☆103Updated last year
zai-org / CogCoM
☆214Updated last year
CeeZh / LLoVi
Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
☆101Updated last year
ShareGPT4Omni / ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
☆239Updated last year
foundation-multimodal-models / CAPTURE
☆76Updated last year