zzxslp / SoM-LLaVA
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
☆123Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for SoM-LLaVA
- ☆131Updated 10 months ago
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆82Updated 4 months ago
- ☆121Updated last week
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆294Updated 3 months ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆227Updated last month
- ☆64Updated 4 months ago
- SVIT: Scaling up Visual Instruction Tuning☆163Updated 4 months ago
- Official implementation of the Law of Vision Representation in MLLMs☆128Updated 2 months ago
- Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆113Updated last month
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆107Updated 4 months ago
- ☆287Updated 9 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆115Updated last month
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)☆264Updated this week
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆242Updated 10 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆130Updated last month
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"☆145Updated last month
- ☆119Updated last month
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆132Updated 3 weeks ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation☆84Updated last month
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆59Updated 2 weeks ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆110Updated 2 months ago
- 【NeurIPS 2024】Dense Connector for MLLMs☆133Updated 3 weeks ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆52Updated last month
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning☆254Updated 7 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆83Updated 2 weeks ago
- A collection of visual instruction tuning datasets.☆76Updated 7 months ago
- Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆46Updated 3 months ago
- Official repo for StableLLAVA☆90Updated 10 months ago
- Densely Captioned Images (DCI) dataset repository.☆158Updated 4 months ago
- VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning☆85Updated last month