Xiaohao-Liu/Awesome-Vison2Audio

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Xiaohao-Liu/Awesome-Vison2Audio)

Xiaohao-Liu / Awesome-Vison2Audio

A curated list of Vision (video/image) to Audio Generation

☆107

Alternatives and similar repositories for Awesome-Vison2Audio

Users that are interested in Awesome-Vison2Audio are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

cyanbx / Frieren-V2A
View on GitHub
Implementation of Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching (NeurIPS'24)
☆63Apr 3, 2025Updated last year
Xiaohao-Liu / ModalBed
View on GitHub
[MM 2025] Towards Modality Generalization: A Benchmark and Prospective Analysis
☆31May 22, 2025Updated last year
jnwnlee / selva
View on GitHub
[CVPR 2026] Official PyTorch implementation of SelVA "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation"
☆15Mar 27, 2026Updated 4 months ago
Xiaohao-Liu / EliMRec
View on GitHub
The implementation of paper "EliMRec: Eliminating single-modal bias in multimedia recommendation", MM'22.
☆24Dec 7, 2023Updated 2 years ago
Xiaohao-Liu / Bundle-MLLM
View on GitHub
[KDD 2025] Fine-tuning Multimodal Large Language Models for Product Bundling
☆16Sep 20, 2025Updated 10 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
xiquan-li / Awesome-Audio-Generation
View on GitHub
Curated list for papers, codes and resources related to Text-to-Audio (TTA) Generation
☆75Jul 20, 2026Updated last week
Xiaohao-Liu / CMCL
View on GitHub
[NeurIPS 2025] Continual Multimodal Contrastive Learning
☆31Dec 18, 2025Updated 7 months ago
Xiaohao-Liu / BundleGT
View on GitHub
The implementation of paper "Strategy-aware Bundle Recommender System", SIGIR'23.
☆17Sep 4, 2023Updated 2 years ago
wsntxxn / UniFlow-Audio
View on GitHub
☆74Jul 17, 2026Updated last week
hkchengrex / av-benchmark
View on GitHub
Benchmarking for Audio-Text and Audio-Visual Generation; Supports FAD, FD_VGG, FD_PANNs, FD_PaSST, IS_PaSST, IS_PANNs, KL_PaSST, KL_PANNs…
☆80Feb 14, 2026Updated 5 months ago
DragonLiu1995 / video-to-audio-through-text
View on GitHub
[NeurIPS 2024] Code, Dataset, Samples for the VATT paper “ Tell What You Hear From What You See - Video to Audio Generation Through Text”
☆38Jul 24, 2025Updated last year
zxxwxyyy / sonique
View on GitHub
Video Background Music Generation Using Unpaired Audio-Visual Data
☆33Oct 8, 2024Updated last year
lmxue / Audio-FLAN
View on GitHub
Audio-FLAN
☆161Sep 23, 2025Updated 10 months ago
chouliuzuo / GVMGen
View on GitHub
☆32Nov 10, 2025Updated 8 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Shy-98 / MELLE
View on GitHub
Unofficial PyTorch implementation of "Autoregressive Speech Synthesis without Vector Quantization (MELLE)"
☆41Jun 28, 2025Updated last year
ETH-DISCO / sao-instruct
View on GitHub
Official repo for SAO-Instruct: Free-form Audio Editing using Natural Language Instructions presented at NeurIPS 2025
☆18Oct 28, 2025Updated 9 months ago
chenyuxin1999 / Abstract_Thought
View on GitHub
[NeurIPS 2025] The implementation of paper "The Emergence of Abstract Thought in Large Language Models Beyond Any Language"
☆19Jun 9, 2025Updated last year
kaist-ami / AVHBench
View on GitHub
[ICLR'25] Official repository for "AVHBench: A Cross-Modal Hallucination Evaluation for Audio-Visual Large Language Models"
☆25Mar 8, 2026Updated 4 months ago
ddlBoJack / Omni-Captioner
View on GitHub
[ICLR 2026] Data Pipeline, Models, and Benchmark for Omni-Captioner.
☆142Apr 7, 2026Updated 3 months ago
jaeyeonkim99 / visage
View on GitHub
Official implementation of "ViSAGe: Video-to-Spatial AUdio Generation" (ICLR 2025)
☆47Sep 10, 2025Updated 10 months ago
JishengBai / AudioSetCaps
View on GitHub
A 6-million Audio-Caption Paired Dataset Built with a LLMs and ALMs-based Automatic Pipeline
☆208Dec 13, 2024Updated last year
HilaManor / AudioEditingCode
View on GitHub
☆195Nov 19, 2025Updated 8 months ago
ddlBoJack / MMAR
View on GitHub
[NeurIPS 2025] Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
☆214Feb 25, 2026Updated 5 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
xiquan-li / MeanAudio
View on GitHub
[ACL 2026 Main] MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
☆145Sep 2, 2025Updated 10 months ago
NKU-HLT / AudioEditor
View on GitHub
☆47Apr 2, 2025Updated last year
bigai-nlco / UltraVoice
View on GitHub
Official Repository of UltraVoice
☆63Oct 28, 2025Updated 9 months ago
ETH-DISCO / blap
View on GitHub
Official repo for BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning presented at ICASSP 2025
☆16Nov 18, 2024Updated last year
zszheng147 / Spatial-AST
View on GitHub
🦇 Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models)
☆87Feb 13, 2025Updated last year
facebookresearch / FlowDec
View on GitHub
An neural full-band audio codec for general audio sampled at 48 kHz with 7.5 kps or 4.5 kbps.
☆213Updated this week
seungheondoh / lp-music-caps
View on GitHub
LP-MusicCaps: LLM-Based Pseudo Music Captioning [ISMIR23]
☆348Apr 8, 2024Updated 2 years ago
ZeyueT / VidMuse
View on GitHub
[CVPR 2025] Repository of VidMuse
☆140Jun 7, 2025Updated last year
wzk1015 / Awesome-Vision-to-Music-Generation
View on GitHub
[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.
☆126Aug 9, 2025Updated 11 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
xiaomi-research / tts-prism
View on GitHub
☆48Apr 27, 2026Updated 3 months ago
haoheliu / audioldm_eval
View on GitHub
This toolbox aims to unify audio generation model evaluation for easier comparison.
☆390Sep 29, 2024Updated last year
xiquan-li / Resonate
View on GitHub
[INTERSPEECH 2026] Pre-training, SFT, DPO and GRPO for Text-to-Audio Generation
☆48Apr 17, 2026Updated 3 months ago
xiaomi-research / dasheng-audiogen
View on GitHub
end-to-end text to audio scene generation model
☆50Jun 16, 2026Updated last month
01Zhangbw / Speech-and-audio-papers-Top-Conference
View on GitHub
☆141Jan 24, 2026Updated 6 months ago
snap-research / GenAU
View on GitHub
☆53Mar 24, 2026Updated 4 months ago
SAGNIKMJR / ego-AV-spatial-correspondence
View on GitHub
[CVPR 2024] Code and datasets for 'Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos'
☆14Jun 16, 2024Updated 2 years ago