HumanMLLM / R1-Omni
☆857Updated last month
Alternatives and similar repositories for R1-Omni:
Users that are interested in R1-Omni are comparing it to the libraries listed below
- ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,268Updated last month
- A fork to add multimodal model training to open-r1☆1,245Updated 3 months ago
- Frontier Multimodal Foundation Models for Image and Video Understanding☆768Updated 3 weeks ago
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆819Updated 2 weeks ago
- R1-onevision, a visual language model capable of deep CoT reasoning.☆513Updated 3 weeks ago
- Explore the Multimodal “Aha Moment” on 2B Model☆583Updated last month
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆2,867Updated last week
- MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning☆590Updated last week
- [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.☆1,227Updated last month
- ☆225Updated 2 months ago
- Scalable RL solution for advanced reasoning of language models☆1,529Updated last month
- An open-sourced end-to-end VLM-based GUI Agent☆936Updated last month
- ☆739Updated 2 weeks ago
- HumanOmni☆158Updated 2 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆489Updated last week
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆900Updated last month
- An Open Large Reasoning Model for Real-World Solutions☆1,488Updated 2 months ago
- EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL☆2,258Updated last week
- Muon is Scalable for LLM Training☆1,039Updated last month
- Codebase for Aria - an Open Multimodal Native MoE☆1,033Updated 3 months ago
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆182Updated 2 weeks ago
- The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.☆1,718Updated 2 weeks ago
- Repo for Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent☆310Updated 2 weeks ago
- LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning☆1,980Updated 3 weeks ago
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,623Updated this week
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆908Updated last week
- Next-Token Prediction is All You Need☆2,111Updated last month
- GPT-4o-level, real-time spoken dialogue system.☆321Updated 3 months ago
- Parsing-free RAG supported by VLMs☆695Updated 2 months ago
- "VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos"☆633Updated last month