sanbuphy / llm-vision-datasetsLinks
Collection of image and video datasets for generative AI and multimodal visual AI
☆31Updated last year
Alternatives and similar repositories for llm-vision-datasets
Users that are interested in llm-vision-datasets are comparing it to the libraries listed below
Sorting:
- 多模态 MM +Chat 合集☆276Updated 2 months ago
 - Efficient Multimodal Large Language Models: A Survey☆373Updated 6 months ago
 - Official repo of Griffon series including v1(ECCV 2024), v2(ICCV 2025), G, and R, and also the RL tool Vision-R1.☆240Updated 2 months ago
 - [COLM 2025] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources☆277Updated 2 months ago
 - Research Code for Multimodal-Cognition Team in Ant Group☆169Updated 2 weeks ago
 - DeepSpeed教程 & 示例注释 & 学习笔记 (大模型高效训练)☆179Updated 2 years ago
 - [ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation☆205Updated 7 months ago
 - [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆395Updated 5 months ago
 - MM-Eureka V0 also called R1-Multimodal-Journey, Latest version is in MM-Eureka☆320Updated 4 months ago
 - NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing☆575Updated last year
 - [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…☆504Updated 9 months ago
 - [CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness☆421Updated 5 months ago
 - Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conv…☆477Updated 7 months ago
 - TaiSu(太素)--a large-scale Chinese multimodal dataset(亿级大规模中文视觉语言预训练数据集)☆191Updated last year
 - My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆262Updated last week
 - [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback☆294Updated last year
 - The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆253Updated last year
 - Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models☆698Updated last month
 - This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages …☆720Updated last month
 - ☆376Updated 8 months ago
 - LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆389Updated this week
 - Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"☆263Updated 5 months ago
 - VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks☆389Updated last year
 - 主要记录大语言大模型(LLMs) 算法(应用)工程师多模态相关知识☆249Updated last year
 - 将SmolVLM2的视觉头与Qwen3-0.6B模型进行了拼接微调☆410Updated last month
 - LinVT: Empower Your Image-level Large Language Model to Understand Videos☆82Updated 10 months ago
 - [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆218Updated 7 months ago
 - Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs☆94Updated 9 months ago
 - ☆58Updated 4 months ago
 - Pruning the VLLMs☆104Updated 10 months ago