NVlabs / OmniVinciView external linksLinks
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
☆633Oct 29, 2025Updated 3 months ago
Alternatives and similar repositories for OmniVinci
Users that are interested in OmniVinci are comparing it to the libraries listed below
Sorting:
- ☆32Nov 18, 2025Updated 2 months ago
- A UI designer for constructing AI applications with OpenSearch☆16Updated this week
- [ASRU 2025] Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?☆42Nov 21, 2025Updated 2 months ago
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs☆46Sep 19, 2025Updated 4 months ago
- Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, im…☆3,429Jan 8, 2026Updated last month
- PICABench: How Far Are We from Physically Realistic Image Editing?☆35Nov 5, 2025Updated 3 months ago
- ☆16Mar 26, 2025Updated 10 months ago
- Spatial Aptitude Training for Multimodal Langauge Models☆24Feb 8, 2026Updated last week
- We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction☆179Feb 3, 2026Updated 2 weeks ago
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆54Feb 1, 2026Updated 2 weeks ago
- The demo page for ALMTokenizer☆58Apr 14, 2025Updated 10 months ago
- Official PyTorch inference code for the Interspeech 2025 paper: Efficient Speech Enhancement via Embeddings from Pre-trained Generative A…☆75Jun 16, 2025Updated 8 months ago
- FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.☆242Nov 11, 2025Updated 3 months ago
- ☆59Jan 26, 2026Updated 3 weeks ago
- OpenFLAM: Framewise Language Audio Model☆88Jan 14, 2026Updated last month
- A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.☆132Sep 19, 2025Updated 4 months ago
- Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.☆612Updated this week
- This is the code for paper: XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs☆85Sep 19, 2025Updated 4 months ago
- UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios☆113Dec 17, 2025Updated 2 months ago
- Official Repository for "Glyph: Scaling Context Windows via Visual-Text Compression"☆561Nov 4, 2025Updated 3 months ago
- ☆26Updated this week
- "Your Fully-Automated Personal AI Assistant"☆49Oct 16, 2025Updated 4 months ago
- Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models☆50Sep 2, 2025Updated 5 months ago
- Official implementation of the paper Locality in Image Diffusion Models Emerges from Data Statistics☆38Dec 25, 2025Updated last month
- MMaDA - Open-Sourced Multimodal Large Diffusion Language Models☆1,574Nov 16, 2025Updated 3 months ago
- Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications☆86Dec 20, 2024Updated last year
- This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).☆278Feb 10, 2026Updated last week
- [ICLR 2026] TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching☆835Jan 28, 2026Updated 2 weeks ago
- A MCP Task Server☆10Mar 7, 2025Updated 11 months ago
- ☆17Aug 5, 2025Updated 6 months ago
- A collection of all our phonemeizers for dataset construction and inference☆27Feb 21, 2025Updated 11 months ago
- 🔥 [ICLR 2025] Official PyTorch Model "Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark"☆26Feb 9, 2025Updated last year
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆97Jan 26, 2026Updated 3 weeks ago
- 📚 Collection of token-level model compression resources.☆190Sep 3, 2025Updated 5 months ago
- MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows☆123Sep 2, 2025Updated 5 months ago
- Fun-Audio-Chat is a Large Audio Language Model built for natural, low-latency voice interactions.☆851Jan 29, 2026Updated 2 weeks ago
- ☆246Dec 21, 2025Updated last month
- [ICLR 2026] LongLive: Real-time Interactive Long Video Generation☆1,054Updated this week
- RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing☆58Dec 26, 2025Updated last month