threegold116 / Awesome-Omni-MLLMsView external linksLinks
This is for ACL 2025 Findings Paper: From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalitiesModels
☆90Jan 3, 2026Updated last month
Alternatives and similar repositories for Awesome-Omni-MLLMs
Users that are interested in Awesome-Omni-MLLMs are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models☆79Dec 27, 2025Updated last month
- KDD 2024 AQA competition 2nd place solution☆12Jul 21, 2024Updated last year
- EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Vi…☆74May 18, 2025Updated 8 months ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))☆56Jun 9, 2025Updated 8 months ago
- ☆185Feb 8, 2025Updated last year
- The implementation for "Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions"☆50Apr 7, 2025Updated 10 months ago
- This repository contains code for AAAI2025 paper "Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal …☆22Aug 18, 2025Updated 5 months ago
- Survey on speech generation work.☆21Nov 26, 2023Updated 2 years ago
- Web application for real-time object detection 🔎 using Flask 🌶, OpenCV, and YoloV3 weights. It uses the COCO Dataset 🖼.☆16Apr 19, 2021Updated 4 years ago
- Awesome paper for multi-modal llm with grounding ability☆19Oct 11, 2025Updated 4 months ago
- ☆20Jan 6, 2023Updated 3 years ago
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆272Jan 27, 2025Updated last year
- Official implementation of "OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging".☆43Oct 30, 2025Updated 3 months ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆31Dec 23, 2024Updated last year
- [ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…☆58Sep 4, 2024Updated last year
- [ICLR 2026] Data Pipeline, Models, and Benchmark for Omni-Captioner.☆119Oct 17, 2025Updated 4 months ago
- SFT+RL boosts multimodal reasoning☆46Jun 27, 2025Updated 7 months ago
- Towards Fine-grained Audio Captioning with Multimodal Contextual Cues☆86Jan 4, 2026Updated last month
- MIO: A Foundation Model on Multimodal Tokens☆33Dec 13, 2024Updated last year
- Reproduction of the complete process of DeepSeek-R1 on small-scale models, including Pre-training, SFT, and RL.☆29Mar 11, 2025Updated 11 months ago
- A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…☆52Jul 24, 2025Updated 6 months ago
- DEYOv1.5☆29Jul 22, 2024Updated last year
- small audio language model for reasoning☆86Dec 4, 2025Updated 2 months ago
- https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT☆123Jan 30, 2026Updated 2 weeks ago
- fine-tune models like ssd_mobilenet、faster_rcnn models in COCO dataset☆27Sep 9, 2020Updated 5 years ago
- Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey☆957Nov 14, 2025Updated 3 months ago
- A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.☆71Mar 18, 2025Updated 11 months ago
- [ICLR2024] Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation☆36Jun 30, 2025Updated 7 months ago
- 🔥An open-source survey of the latest video reasoning tasks, paradigms, and benchmarks.☆145Jan 16, 2026Updated last month
- Our 2nd-gen LMM☆34May 22, 2024Updated last year
- Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give…☆209Oct 12, 2025Updated 4 months ago
- ☆262May 19, 2025Updated 8 months ago
- [CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering☆46Jun 19, 2025Updated 7 months ago
- [NeurIPS'25 Spotlight] Official implementation of "JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation"☆70Jan 10, 2026Updated last month
- This is the official repo for the paper "LongCat-Flash-Omni Technical Report"☆477Feb 10, 2026Updated last week
- Open-vocabulary Semantic Segmentation☆33Feb 16, 2024Updated 2 years ago
- Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];☆40Jan 4, 2024Updated 2 years ago
- MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment☆35Jul 1, 2024Updated last year