ShareGPT4Omni / ShareGPT4V
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
☆112Updated 2 months ago
Related projects: ⓘ
- ☆128Updated 8 months ago
- ☆113Updated 2 months ago
- Official repository of MMDU dataset☆61Updated last month
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆80Updated 2 months ago
- Dense Connector for MLLMs☆98Updated last month
- Official repo for StableLLAVA☆90Updated 8 months ago
- ☆101Updated 5 months ago
- EVE: Encoder-Free Vision-Language Models☆207Updated last month
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆128Updated last month
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆93Updated last month
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆114Updated 2 months ago
- ☆82Updated 2 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆105Updated last month
- This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"☆138Updated 5 months ago
- Explore the Limits of Omni-modal Pretraining at Scale☆80Updated 2 weeks ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆103Updated 3 weeks ago
- Please refer to our official repo at https://github.com/IVGSZ/Flash-VStream.☆48Updated last month
- Video dataset dedicated to portrait-mode video recognition.☆35Updated 5 months ago
- [ECCV 2024] Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs☆45Updated last month
- ☆99Updated this week
- The official implementation of RAR☆61Updated 5 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆75Updated 2 weeks ago
- [NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT☆129Updated 4 months ago
- ☆100Updated last month
- official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input☆44Updated 3 weeks ago
- A collection of visual instruction tuning datasets.☆74Updated 6 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!☆114Updated 8 months ago
- Implements VAR+CLIP for image generation☆64Updated last month
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆189Updated 6 months ago
- Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis☆75Updated 2 months ago