EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [π₯The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]
β74May 18, 2025Updated 9 months ago
Alternatives and similar repositories for EchoInk
Users that are interested in EchoInk are comparing it to the libraries listed below
Sorting:
- ICASSP2026 HumDial Challengeβ36Dec 13, 2025Updated 2 months ago
- [ASRU 2025] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?β43Nov 21, 2025Updated 3 months ago
- Adaptive Multimodal Reasoning via Reinforcement Learningβ23Jan 11, 2026Updated last month
- β17Mar 26, 2021Updated 4 years ago
- Implementation and experiment of the MusGConv paper.β15Sep 6, 2024Updated last year
- β17May 5, 2024Updated last year
- This is for ACL 2025 Findings Paper: From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalitiesModelsβ92Jan 3, 2026Updated 2 months ago
- Official PyTorch implementation of RACRO (https://www.arxiv.org/abs/2506.04559)β19Jul 1, 2025Updated 8 months ago
- β19Sep 1, 2025Updated 6 months ago
- Colab notebook for fine-tuning Qwen2-Audio with trl's SFT and PPO trainers.β24Nov 23, 2024Updated last year
- This is a repository for fine-tuning Qwen2-Audio, currently supporting Distributed Data Parallel (DDP) and DeepSpeed.β49Jul 28, 2025Updated 7 months ago
- wav2vec2 audio classification for prosodic boundary detection and other tasksβ42Aug 11, 2023Updated 2 years ago
- Code for paper "Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition"β19Jun 21, 2023Updated 2 years ago
- LUCY: Linguistic Understanding and Control Yielding Early Stage of Herβ59Apr 14, 2025Updated 10 months ago
- [NeurIPS2024] Official code for (IMA) Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputsβ23Oct 15, 2024Updated last year
- Keras Implementation of "Look, Listen and Learn" Modelβ21Nov 14, 2017Updated 8 years ago
- SFT+RL boosts multimodal reasoningβ46Jun 27, 2025Updated 8 months ago
- Code for the paper: "Leveraging speaker attribute information using multi task learning for speaker verification and diarization" presentβ¦β26Oct 5, 2022Updated 3 years ago
- Recent Advances in Visual Dialogβ30Aug 19, 2022Updated 3 years ago
- [CVPR2026] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twiceβ65Feb 27, 2026Updated last week
- A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.β73Mar 18, 2025Updated 11 months ago
- Machine learning speaker characteristicsβ42Feb 26, 2026Updated last week
- Awesome-Representation-Learning-CV-PaperAndCodes, lasted development in the representation learning area.β33Apr 24, 2023Updated 2 years ago
- Code for the paper: GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilitiesβ153Dec 5, 2024Updated last year
- Non-parallel voice conversion called ICRCycleGAN-VC based on CycleGAN and Inception-resNet module by Afiunyβ15Oct 30, 2025Updated 4 months ago
- The repository of VG-Refiner paperβ17Dec 9, 2025Updated 2 months ago
- β36Jul 9, 2025Updated 7 months ago
- [NeurIPS 2025] Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mixβ197Feb 25, 2026Updated last week
- Frequency tracking in time-frequency representationsβ13Jan 19, 2021Updated 5 years ago
- Building a multi-agent RAG system with advanced RAG methodsβ12Jan 12, 2025Updated last year
- β10Dec 8, 2025Updated 2 months ago
- This branch of Asteroid contains code for the vocal harmony and chamber ensemble separation related papers.β12Nov 7, 2024Updated last year
- Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentationβ12Dec 5, 2025Updated 3 months ago
- A simple exam generator and grader written in Python with OpenCVβ14Jan 14, 2026Updated last month
- β10Oct 20, 2022Updated 3 years ago
- [NeurIPS 2025] HoliTom: Holistic Token Merging for Fast Video Large Language Modelsβ71Oct 10, 2025Updated 4 months ago
- code for paper "Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models"β46Sep 21, 2023Updated 2 years ago
- [WACV'25 Oral] Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transferβ56Feb 25, 2025Updated last year
- β133Jan 24, 2026Updated last month