umbertocappellazzo / Omni-AVSRLinks
Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models".
☆28Updated last week
Alternatives and similar repositories for Omni-AVSR
Users that are interested in Omni-AVSR are comparing it to the libraries listed below
Sorting:
- ☆78Updated 8 months ago
- Official code of the paper: Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis.☆45Updated last year
- Towards Fine-grained Audio Captioning with Multimodal Contextual Cues☆86Updated last week
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs☆42Updated 3 months ago
- ☆76Updated 3 months ago
- ☆61Updated 6 months ago
- Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation☆62Updated 6 months ago
- Music production for silent film clips.☆31Updated 8 months ago
- FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.☆55Updated last month
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos☆25Updated last year
- ☆62Updated 6 months ago
- The official implementation of OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows☆122Updated 4 months ago
- This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image …☆87Updated last year
- An official implementation of SwapAnyone.☆72Updated 9 months ago
- A project for tri-modal LLM benchmarking and instruction tuning.☆53Updated 9 months ago
- ☆19Updated last year
- Demo page of TAVGBench: Benchmarking Text to Audible-Video Generation☆14Updated 9 months ago
- ☆47Updated 8 months ago
- a text-conditional diffusion probabilistic model capable of generating high fidelity audio.☆188Updated last year
- The official code repository for SongPrep: A Preprocessing Framework and End-to-end Model for Full-song Structure Parsing and Lyrics Tran…☆131Updated last month
- ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation☆109Updated last month
- LVAS-Agent Code Base☆21Updated 8 months ago
- The official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)☆80Updated 8 months ago
- Anim-400K: A dataset designed from the ground up for automated dubbing of video☆110Updated last year
- ☆20Updated last year
- ☆20Updated 3 years ago
- ☆41Updated 5 months ago
- Official source codes for the paper: EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing.☆32Updated 7 months ago
- video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is d…☆136Updated 2 weeks ago
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆39Updated last year