friedrichor / Awesome-Multimodal-PapersView external linksLinks
A curated list of awesome Multimodal studies.
β312Dec 14, 2025Updated 2 months ago
Alternatives and similar repositories for Awesome-Multimodal-Papers
Users that are interested in Awesome-Multimodal-Papers are comparing it to the libraries listed below
Sorting:
- Official Repository of RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuningβ14Jul 9, 2025Updated 7 months ago
- π Awesome papers on token redundancy reductionβ11Mar 12, 2025Updated 11 months ago
- Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.β755Jan 22, 2026Updated 3 weeks ago
- The codebase for our EMNLP24 paper: Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Moβ¦β86Jan 27, 2025Updated last year
- Latest Advances on Multimodal Large Language Modelsβ17,337Feb 7, 2026Updated last week
- Official repository for LLaVA-Reward (ICCV 2025): Multimodal LLMs as Customized Reward Models for Text-to-Image Generationβ23Jul 30, 2025Updated 6 months ago
- β¨β¨ [ICLR 2026] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Modelsβ43Apr 10, 2025Updated 10 months ago
- An open-source implementaion for fine-tuning DINOv2 by Meta.β13Jul 21, 2025Updated 6 months ago
- The official implement of "Grounded Chain-of-Thought for Multimodal Large Language Models"β21Jul 21, 2025Updated 6 months ago
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β799Oct 10, 2025Updated 4 months ago
- π₯π₯π₯ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).β540Apr 4, 2025Updated 10 months ago
- [AAAI 2026 Oral] The official code of "UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning"β62Dec 8, 2025Updated 2 months ago
- β14May 26, 2023Updated 2 years ago
- MLLM @ Gameβ16May 12, 2025Updated 9 months ago
- Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grainβ¦β110Aug 21, 2025Updated 5 months ago
- β21Jul 9, 2025Updated 7 months ago
- [MM'2024] Official release of RFUND introduced in the MM'2024 paper "PEneo: Unifying Line Extraction, Line Grouping, and Entity Linking fβ¦β20Dec 4, 2024Updated last year
- This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-basβ¦β1,350Dec 7, 2025Updated 2 months ago
- Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"β34Jul 12, 2024Updated last year
- This repository is related to 'Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models', published at TMLR (2024), httpsβ¦β22Jul 5, 2024Updated last year
- β25Nov 17, 2025Updated 3 months ago
- Famous Vision Language Models and Their Architecturesβ1,178Jan 11, 2026Updated last month
- [ICLR2025] Ξ³ -MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Modelsβ42Oct 28, 2025Updated 3 months ago
- [ICCV 2025] Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning.β49Dec 10, 2025Updated 2 months ago
- Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Modβ¦β361Mar 19, 2025Updated 10 months ago
- β22Dec 11, 2025Updated 2 months ago
- β360Jan 27, 2024Updated 2 years ago
- Efficient Multimodal Large Language Models: A Surveyβ387Apr 29, 2025Updated 9 months ago
- This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]β569Updated this week
- [Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Surveyβ477Jan 17, 2025Updated last year
- Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"β32Mar 26, 2025Updated 10 months ago
- Learning Situation Hyper-Graphs for Video Question Answeringβ22Feb 16, 2024Updated 2 years ago
- A Survey on Benchmarks of Multimodal Large Language Modelsβ148Jul 1, 2025Updated 7 months ago
- Collection of AWESOME vision-language models for vision tasksβ3,081Oct 14, 2025Updated 4 months ago
- β28Feb 2, 2026Updated 2 weeks ago
- Official Repository for CLRCMD (Appear in ACL2022)β43Feb 21, 2023Updated 2 years ago
- β12Feb 2, 2024Updated 2 years ago
- F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electrβ¦β34Jul 3, 2025Updated 7 months ago
- Agentic Keyframe Search for Video Question Answeringβ16Apr 7, 2025Updated 10 months ago