OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
☆637Feb 26, 2026Updated last week
Alternatives and similar repositories for OmniVinci
Users that are interested in OmniVinci are comparing it to the libraries listed below
Sorting:
- A UI designer for constructing AI applications with OpenSearch☆16Feb 26, 2026Updated last week
- ☆33Nov 18, 2025Updated 3 months ago
- [ASRU 2025] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?☆43Nov 21, 2025Updated 3 months ago
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs☆46Sep 19, 2025Updated 5 months ago
- Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, im…☆3,488Jan 8, 2026Updated 2 months ago
- PICABench: How Far Are We from Physically Realistic Image Editing?☆36Nov 5, 2025Updated 4 months ago
- ☆16Mar 26, 2025Updated 11 months ago
- Spatial Aptitude Training for Multimodal Langauge Models☆24Feb 8, 2026Updated last month
- We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction☆180Mar 2, 2026Updated last week
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆56Feb 1, 2026Updated last month
- The demo page for ALMTokenizer☆59Apr 14, 2025Updated 10 months ago
- FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.☆242Feb 25, 2026Updated last week
- ☆63Jan 26, 2026Updated last month
- A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.☆134Sep 19, 2025Updated 5 months ago
- Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.☆636Feb 12, 2026Updated 3 weeks ago
- This is the code for paper: XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs☆90Sep 19, 2025Updated 5 months ago
- ☆63Jul 11, 2025Updated 7 months ago
- OpenFLAM: Framewise Language Audio Model☆100Jan 14, 2026Updated last month
- Official Repository for "Glyph: Scaling Context Windows via Visual-Text Compression"☆561Nov 4, 2025Updated 4 months ago
- UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios☆119Dec 17, 2025Updated 2 months ago
- Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models☆50Sep 2, 2025Updated 6 months ago
- "Your Fully-Automated Personal AI Assistant"☆49Oct 16, 2025Updated 4 months ago
- Official implementation of the paper Locality in Image Diffusion Models Emerges from Data Statistics☆39Dec 25, 2025Updated 2 months ago
- MMaDA - Open-Sourced Multimodal Large Diffusion Language Models (dLLMs with block diffusion, mixed-CoT, unified RL)☆1,591Feb 14, 2026Updated 3 weeks ago
- Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications☆88Dec 20, 2024Updated last year
- [ICLR 2026] LongLive: Real-time Interactive Long Video Generation☆1,091Feb 26, 2026Updated last week
- This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).☆283Feb 17, 2026Updated 2 weeks ago
- Implementation of "Look, Listen and Recognise:character-aware audio-visual subtitling"☆19Nov 3, 2025Updated 4 months ago
- ☆17Aug 5, 2025Updated 7 months ago
- A MCP Task Server☆11Mar 7, 2025Updated last year
- ☆56Updated this week
- 🔥 [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"☆26Feb 9, 2025Updated last year
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆98Jan 26, 2026Updated last month
- 📚 Collection of token-level model compression resources.☆193Sep 3, 2025Updated 6 months ago
- A collection of all our phonemeizers for dataset construction and inference☆28Feb 21, 2025Updated last year
- RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing☆58Dec 26, 2025Updated 2 months ago
- [ICCV 2025] Code & Data for: SuperEdit - Rectifying and Facilitating Supervision for Instruction-Based Image Editing☆164Jun 26, 2025Updated 8 months ago
- ☆246Dec 21, 2025Updated 2 months ago
- A Framework for Speech, Language, Audio, Music Processing with Large Language Model☆995Jan 15, 2026Updated last month