NVlabs / OmniVinciLinks
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
☆628Updated 3 months ago
Alternatives and similar repositories for OmniVinci
Users that are interested in OmniVinci are comparing it to the libraries listed below
Sorting:
- AudioStory: Generating Long-Form Narrative Audio with Large Language Models☆295Updated 4 months ago
- ☆77Updated 8 months ago
- This is the official repo for the paper "LongCat-Flash-Omni Technical Report"☆460Updated last week
- Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its…☆290Updated last week
- Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.☆575Updated 3 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆145Updated last month
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆856Updated 3 months ago
- NextStep-1: SOTA Autogressive Image Generation with Continuous Tokens. A research project developed by the StepFun’s Multimodal Intellige…☆598Updated last month
- ☆185Updated 11 months ago
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆565Updated last week
- MiMo-VL☆621Updated 5 months ago
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆96Updated this week
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)☆371Updated 3 months ago
- HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation☆671Updated 3 months ago
- Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)☆349Updated 2 weeks ago
- DACVAE☆189Updated last month
- WeDLM: The fastest diffusion language model with standard causal attention and native KV cache compatibility, delivering real speedups ov…☆588Updated 2 weeks ago
- Official implementation of "Continuous Autoregressive Language Models"☆714Updated last month
- A Scientific Multimodal Foundation Model☆627Updated 4 months ago
- ☆491Updated last month
- A reproduction of the Deepseek-OCR model including training☆206Updated 2 months ago
- ☆572Updated 2 weeks ago
- Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B☆565Updated 2 months ago
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆178Updated last month
- An open-source implementation of Whisper☆475Updated 3 months ago
- LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM☆294Updated 8 months ago
- The official GitHub Page for MiniMax☆61Updated 2 months ago
- VoiceStar: Robust, Duration-controllable TTS that can Extrapolate☆306Updated 7 months ago
- Kyutai with an "eye"☆235Updated 10 months ago
- ☆925Updated last week