NVlabs / OmniVinciLinks
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
☆603Updated last month
Alternatives and similar repositories for OmniVinci
Users that are interested in OmniVinci are comparing it to the libraries listed below
Sorting:
- AudioStory: Generating Long-Form Narrative Audio with Large Language Models☆291Updated 2 months ago
- ☆78Updated 7 months ago
- This is the official repo for the paper "LongCat-Flash-Omni Technical Report"☆436Updated this week
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆760Updated 2 months ago
- An open-source implementation of Whisper☆469Updated last month
- Ming - facilitating advanced multimodal understanding and generation capabilities built upon the Ling LLM.☆555Updated last month
- A Scientific Multimodal Foundation Model☆618Updated 2 months ago
- ☆574Updated last month
- Fully Open Framework for Democratized Multimodal Training☆655Updated this week
- HunyuanImage-2.1: An Efficient Diffusion Model for High-Resolution (2K) Text-to-Image Generation☆663Updated 2 months ago
- Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)☆329Updated 2 months ago
- ☆183Updated 10 months ago
- MiMo-VL☆596Updated 3 months ago
- ☆437Updated 3 weeks ago
- Liquid Audio - Speech-to-Speech audio models by Liquid AI☆289Updated 2 months ago
- Official implementation of "Continuous Autoregressive Language Models"☆673Updated 2 weeks ago
- ☆408Updated 3 weeks ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆94Updated last month
- Kyutai with an "eye"☆230Updated 8 months ago
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆153Updated last month
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆128Updated last month
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)☆319Updated last month
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆530Updated this week
- The official GitHub Page for MiniMax☆60Updated last month
- Native Multimodal Models are World Learners☆1,342Updated 2 weeks ago
- Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B☆532Updated 3 weeks ago
- VoiceStar: Robust, Duration-controllable TTS that can Extrapolate☆298Updated 6 months ago
- ☆844Updated 3 months ago
- A reproduction of the Deepseek-OCR model including training☆196Updated 3 weeks ago
- ☆249Updated 6 months ago