MOVA: Towards Scalable and Synchronized Video–Audio Generation
☆740Feb 19, 2026Updated last week
Alternatives and similar repositories for MOVA
Users that are interested in MOVA are comparing it to the libraries listed below
Sorting:
- MOSS-Speech is a true speech-to-speech large language model without text guidance.☆124Feb 13, 2026Updated 2 weeks ago
- Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control☆80Updated this week
- We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction☆179Feb 3, 2026Updated 3 weeks ago
- KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation☆36Feb 6, 2026Updated 3 weeks ago
- An instruct text-to-speech solution based on LLaSA and CosyVoice2 developed by the ASLP lab and collaborators.☆220Jan 20, 2026Updated last month
- ☆10Feb 14, 2025Updated last year
- The demo page for ALMTokenizer☆59Apr 14, 2025Updated 10 months ago
- ☆49Feb 12, 2026Updated 2 weeks ago
- ☆15Sep 23, 2024Updated last year
- Open-Ended Speaking Style Modeling via Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training☆64Feb 7, 2026Updated 3 weeks ago
- Code for 'JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion'☆199Feb 10, 2026Updated 2 weeks ago
- Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.☆4,053Feb 9, 2026Updated 2 weeks ago
- Lynx: Towards High-Fidelity Personalized Video Generation☆309Sep 26, 2025Updated 5 months ago
- Unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization (MELLE)"☆41Jun 28, 2025Updated 8 months ago
- Streaming Vocos☆30Jun 10, 2025Updated 8 months ago
- PICABench: How Far Are We from Physically Realistic Image Editing?☆35Nov 5, 2025Updated 3 months ago
- Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation☆432Nov 27, 2025Updated 3 months ago
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆12Jun 28, 2025Updated 7 months ago
- Repository for "TESS-2: A Large-Scale, Generalist Diffusion Language Model"☆54Feb 20, 2025Updated last year
- ☆77Sep 25, 2025Updated 5 months ago
- [ICML 2025 Tokenization Workshop] HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling☆78Sep 28, 2025Updated 5 months ago
- Here we will track the latest Audio AI Agent, including speech, music, sound effects, etc.☆16Dec 8, 2023Updated 2 years ago
- MOSS-TTSD is a spoken dialogue generation model designed for expressive multi-speaker synthesis. It features long-context modeling, flex…☆1,180Feb 17, 2026Updated last week
- The official implementation of the DIFFA series for dLLM-based large audio language model☆59Feb 2, 2026Updated 3 weeks ago
- [NeurIPS' 25] Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges.☆189Dec 9, 2025Updated 2 months ago
- [ICLR 2026] UniVideo: Unified Understanding, Generation, and Editing for Videos☆431Feb 11, 2026Updated 2 weeks ago
- Some script for helping using Montreal Forced Aligner, maily for transforming Hanzi character to pinyin and extrat pause time from .textg…☆14Feb 9, 2024Updated 2 years ago
- MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows☆123Sep 2, 2025Updated 5 months ago
- Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders☆208Feb 13, 2026Updated 2 weeks ago
- Pose Extraction & Rendering for SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representat…☆178Dec 28, 2025Updated 2 months ago
- ☆2,127Dec 20, 2025Updated 2 months ago
- LongCat Audio Tokenizer and Detokenizer☆284Updated this week
- SoulX-FlashTalk is the first 14B model to achieve sub-second start-up latency (0.87s) while maintaining a real-time throughput of 32 FPS …☆992Feb 12, 2026Updated 2 weeks ago
- SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation☆581Dec 23, 2025Updated 2 months ago
- poorman's ar-dit tts☆45Dec 31, 2025Updated 2 months ago
- A unified tokenizer that is capable of both extracting semantic information and enabling high-fidelity audio reconstruction.☆134Sep 19, 2025Updated 5 months ago
- An Open-Source Project to Unify Audio Processing and Generation☆207Jan 29, 2026Updated last month
- ☆15Sep 20, 2023Updated 2 years ago
- A package for NeuCodec: a 50hz, 0.8kbps, 24kHz audio codec.☆153Jan 27, 2026Updated last month