HarryHsing / EchoInkLinks

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

☆60

Alternatives and similar repositories for EchoInk

Users that are interested in EchoInk are comparing it to the libraries listed below

Sorting:

threegold116 / Awesome-Omni-MLLMs
This is for ACL 2025 Findings Paper: From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalitiesModels
☆60Updated last month
RainBowLuoCS / OpenOmni
(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Align…
☆107Updated last month
emova-ollm / EMOVA
Official PyTorch implementation of EMOVA in CVPR 2025 (https://arxiv.org/abs/2409.18042)
☆74Updated 7 months ago
multimodal-art-projection / OmniBench
A project for tri-modal LLM benchmarking and instruction tuning.
☆48Updated 7 months ago
BriansIDP / video-SALMONN-o1
☆35Updated 2 months ago
HumanMLLM / Omni-Emotion
☆21Updated 9 months ago
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆31Updated last month
GeWu-Lab / Crab
[CVPR 2025] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
☆73Updated 4 months ago
BriansIDP / AudioVisualLLM
☆19Updated last year
rikeilong / Bay-CAT
[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…
☆57Updated last year
OmniMMI / OpenOmniNexus
a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
☆27Updated 6 months ago
scofield7419 / Video-of-Thought
Video Chain of Thought, Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"
☆168Updated 8 months ago
FrankYang-17 / MME-VideoOCR
☆34Updated 5 months ago
ttgeng233 / LongVALE
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))
☆51Updated 4 months ago
AV-Reasoner / AV-Reasoner
☆17Updated 3 months ago
lzw-lzw / UnifiedMLLM
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
☆22Updated last year
360CVGroup / Inner-Adaptor-Architecture
LMM solved catastrophic forgetting, AAAI2025
☆44Updated 6 months ago
schowdhury671 / meerkat
☆33Updated 3 months ago
bronyayang / HallE_Control
HallE-Control: Controlling Object Hallucination in LMMs
☆31Updated last year
yaolinli / DeCo
Code for DeCo: Decoupling token compression from semanchc abstraction in multimodal large language models
☆74Updated 3 months ago
MC-EIU / MC-EIU
☆24Updated 6 months ago
Kevinz-code / SeVa
[MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501
☆57Updated last year
AV-Odyssey / AV-Odyssey
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
☆30Updated 10 months ago
invictus717 / MiCo
[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale
☆118Updated last year
GiantAILab / DeepDubber-V1
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning…
☆25Updated last month
zhuyjan / MER2025-MRAC25
[ACM-MM 2025 Workshop] More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment.
☆24Updated last month
yyysjz1997 / Awesome-AudioVision-Multimodal
A list of current Audio-Vision Multimodal with awesome resources (paper, application, data, review, survey, etc.).
☆27Updated 2 years ago
Ceaglex / LoVA
The code and weight for LoVA. LoVA is a novel model for Long-form Video-to-Audio generation. Based on the Diffusion Transformer (DiT) arc…
☆15Updated 8 months ago
JiuTian-VL / MoME
[NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
☆72Updated 5 months ago
MCR-PEFT / C-MCR
☆43Updated 5 months ago