NVlabs / OmniVinciLinks
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
☆342Updated this week
Alternatives and similar repositories for OmniVinci
Users that are interested in OmniVinci are comparing it to the libraries listed below
Sorting:
- NEO Series: Native Vision-Language Models from First Principles☆180Updated last week
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆577Updated 2 weeks ago
- AudioStory: Generating Long-Form Narrative Audio with Large Language Models☆284Updated last month
- ☆78Updated 5 months ago
- Kyutai with an "eye"☆222Updated 7 months ago
- An open-source implementation of Whisper☆451Updated 3 weeks ago
- Liquid Audio - Speech-to-Speech audio models by Liquid AI☆206Updated last month
- The official GitHub Page for MiniMax☆57Updated 3 months ago
- The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"☆119Updated last month
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆106Updated last week
- Official PyTorch implementation of TokenSet.☆126Updated 7 months ago
- VoiceStar: Robust, Duration-controllable TTS that can Extrapolate☆292Updated 4 months ago
- ☆55Updated 11 months ago
- HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation☆652Updated 2 weeks ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆51Updated last week
- The offical repo for "Parallel-R1: Towards Parallel Thinking via Reinforcement Learning"☆229Updated last week
- Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)☆317Updated 3 weeks ago
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆361Updated 2 weeks ago
- ☆19Updated 7 months ago
- ☆163Updated 3 months ago
- The open-source code of MetaStone-S1.☆107Updated 2 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆104Updated 2 months ago
- ☆561Updated last week
- The code repository of the paper: Competition and Attraction Improve Model Fusion☆161Updated 2 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆98Updated last week
- A Scientific Multimodal Foundation Model☆587Updated last month
- ☆89Updated last year
- ☆87Updated 5 months ago
- ☆300Updated 2 months ago
- LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM☆287Updated 5 months ago