Sally-SH / VSP-LLM
☆317Updated last month
Alternatives and similar repositories for VSP-LLM:
Users that are interested in VSP-LLM are comparing it to the libraries listed below
- MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation☆382Updated last year
- ☆176Updated 10 months ago
- ☆158Updated 5 months ago
- High-quality Text-to-Audio Generation with Efficient Diffusion Transformer☆268Updated 3 weeks ago
- ☆221Updated last month
- ☆369Updated 2 months ago
- ☆256Updated last year
- a text-conditional diffusion probabilistic model capable of generating high fidelity audio.☆162Updated 11 months ago
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR 2025)☆181Updated last week
- 💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning☆182Updated last week
- A lightweight end-to-end text-to-speech model☆113Updated 2 months ago
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆130Updated 5 months ago
- ☆186Updated 9 months ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models☆278Updated last year
- Kyutai with an "eye"☆189Updated last month
- FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. AI拟音大师,给你的无声视频添加生动而且同步的音效 😝☆578Updated 9 months ago
- Speech Diarization for scrum automation☆103Updated last year
- A toolkit for speaker diarization.☆185Updated this week
- This is the official repository for M2UGen☆488Updated 4 months ago
- Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3☆405Updated 7 months ago
- [CVPR2024] Make Your Dream A Vlog☆423Updated last year
- Using Claude Sonnet 3.5 to forward (reverse) engineer code from VASA white paper - WIP - (this is for La Raza 🎷)☆290Updated 5 months ago
- ☆174Updated last year
- ICASSP2024: Adaptive Super Resolution For One-Shot Talking-Head Generation☆180Updated last year
- Official implementation of the paper "MusicInfuser: Making Video Diffusion Listen and Dance"☆68Updated 3 weeks ago
- [Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation☆151Updated 2 months ago
- ☆254Updated this week
- An open source community implementation of the model from the paper: "Movie Gen: A Cast of Media Foundation Models". Join our community …☆60Updated this week
- ☆55Updated 9 months ago
- ☆354Updated 8 months ago