A curated list of awesome Multimodal studies.
β322Mar 11, 2026Updated last month
Alternatives and similar repositories for Awesome-Multimodal-Papers
Users that are interested in Awesome-Multimodal-Papers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π Awesome papers on token redundancy reductionβ11Mar 12, 2025Updated last year
- Official Repository of RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuningβ14Jul 9, 2025Updated 9 months ago
- Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.β756Apr 6, 2026Updated last week
- Latest Advances on Multimodal Large Language Modelsβ17,624Apr 9, 2026Updated last week
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β817Oct 10, 2025Updated 6 months ago
- GPUs on demand by Runpod - Special Offer Available β’ AdRun AI, ML, and HPC workloads on powerful cloud GPUsβwithout limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Moβ¦β86Jan 27, 2025Updated last year
- Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grainβ¦β113Aug 21, 2025Updated 7 months ago
- β22Jul 9, 2025Updated 9 months ago
- β¨β¨ [ICLR 2026] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Modelsβ43Apr 10, 2025Updated last year
- π₯π₯π₯ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).β545Apr 4, 2025Updated last year
- Famous Vision Language Models and Their Architecturesβ1,229Jan 11, 2026Updated 3 months ago
- [AAAI 2026 Oral] The official code of "UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning"β71Dec 8, 2025Updated 4 months ago
- Official repository for LLaVA-Reward (ICCV 2025): Multimodal LLMs as Customized Reward Models for Text-to-Image Generationβ23Jul 30, 2025Updated 8 months ago
- [CVPR 2025] DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understandingβ29Dec 18, 2025Updated 4 months ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"β34Jul 12, 2024Updated last year
- The official implement of "Grounded Chain-of-Thought for Multimodal Large Language Models"β21Jul 21, 2025Updated 8 months ago
- This repository is related to 'Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models', published at TMLR (2024), httpsβ¦β22Jul 5, 2024Updated last year
- This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-basβ¦β1,401Feb 26, 2026Updated last month
- β25Nov 17, 2025Updated 5 months ago
- β37Apr 6, 2026Updated last week
- Collection of AWESOME vision-language models for vision tasksβ3,112Oct 14, 2025Updated 6 months ago
- This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]β624Apr 12, 2026Updated last week
- Agentic Keyframe Search for Video Question Answeringβ18Apr 7, 2025Updated last year
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Modβ¦β369Mar 19, 2025Updated last year
- [MM'2024] Official release of RFUND introduced in the MM'2024 paper "PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking fβ¦β21Dec 4, 2024Updated last year
- [CVPR'24] Official implementation of our paper "Self-Supervised Facial Representation Learning with Facial Region Awareness"β15Mar 8, 2024Updated 2 years ago
- [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedbackβ307Sep 11, 2024Updated last year
- Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"β33Mar 26, 2025Updated last year
- Efficient Multimodal Large Language Models: A Surveyβ386Apr 29, 2025Updated 11 months ago
- β14May 26, 2023Updated 2 years ago
- Reading list for research topics in multimodal machine learningβ6,865Aug 20, 2024Updated last year
- MLLM @ Gameβ16May 12, 2025Updated 11 months ago
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Modelsβ97Sep 14, 2024Updated last year
- Code repository for the paper "The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Leβ¦β13Jan 16, 2025Updated last year
- [ICCV 2025] Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning.β53Mar 17, 2026Updated last month
- π A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).β1,003Sep 27, 2025Updated 6 months ago
- We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing theirβ¦β22Jan 11, 2026Updated 3 months ago
- VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluationβ17Jun 2, 2025Updated 10 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'β206Jul 17, 2025Updated 9 months ago