stepfun-ai/Step-Audio

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/stepfun-ai/Step-Audio)

stepfun-ai / Step-Audio

☆31

Alternatives and similar repositories for Step-Audio

Users that are interested in Step-Audio are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

stepfun-ai / Step-Video-T2V
View on GitHub
☆3,182Mar 17, 2025Updated last year
FunAudioLLM / CosyVoice
View on GitHub
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
☆22,188May 25, 2026Updated last month
MoonshotAI / Kimi-Audio
View on GitHub
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
☆4,668Jun 21, 2025Updated last year
zai-org / GLM-4-Voice
View on GitHub
GLM-4-Voice | 端到端中英语音对话模型
☆3,204Dec 5, 2024Updated last year
SparkAudio / Spark-TTS
View on GitHub
Spark-TTS Inference Code
☆11,001Apr 9, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
QwenLM / Qwen2-Audio
View on GitHub
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
☆2,087Apr 21, 2025Updated last year
SWivid / F5-TTS
View on GitHub
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
☆14,952Jul 5, 2026Updated last week
baichuan-inc / Baichuan-Audio
View on GitHub
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
☆222Feb 28, 2025Updated last year
FireRedTeam / FireRedTTS
View on GitHub
An Open-Sourced LLM-empowered Foundation TTS System
☆908Sep 28, 2025Updated 9 months ago
FunAudioLLM / SenseVoice
View on GitHub
Open-source SenseVoiceSmall model for Mandarin, Cantonese, English, Japanese, and Korean ASR, language ID, emotion recognition, and audio…
☆8,864Updated this week
bytedance / MegaTTS3
View on GitHub
☆6,085Jun 15, 2026Updated last month
modelscope / FunASR
View on GitHub
Open-source speech recognition toolkit for training, inference, streaming ASR, VAD, punctuation, speaker diarization pipelines, and OpenA…
☆19,256Updated this week
fishaudio / fish-speech
View on GitHub
SOTA Open Source TTS
☆31,273Jun 9, 2026Updated last month
open-mmlab / Amphion
View on GitHub
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junio…
☆9,931Mar 25, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
QwenLM / Qwen2.5-Omni
View on GitHub
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…
☆4,037Jun 12, 2025Updated last year
modelscope / ClearerVoice-Studio
View on GitHub
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Spe…
☆4,304Aug 14, 2025Updated 11 months ago
SkyworkAI / SkyReels-V1
View on GitHub
SkyReels V1: The first and most advanced open-source human-centric video foundation model
☆2,692Mar 10, 2025Updated last year
bytedance / LatentSync
View on GitHub
Taming Stable Diffusion for Lip Sync!
☆5,875Jun 20, 2025Updated last year
stepfun-ai / Step-Audio2
View on GitHub
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation…
☆1,479Mar 16, 2026Updated 4 months ago
ASLP-lab / OSUM
View on GitHub
OSUM & OSUM-EChat, open speech understanding model and empathetic spoken chatbot based on it, open-sourced by ASLP@NPU.
☆494Nov 23, 2025Updated 7 months ago
FunAudioLLM / FunMusic
View on GitHub
A fundamental toolkit designed for music, song, and audio generation
☆1,367May 20, 2025Updated last year
zhenye234 / X-Codec-2.0
View on GitHub
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
☆361Jun 25, 2026Updated 3 weeks ago
kyutai-labs / moshi
View on GitHub
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audi…
☆10,591May 16, 2026Updated 2 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
2noise / ChatTTS
View on GitHub
A generative speech model for daily dialogue.
☆39,619Apr 10, 2026Updated 3 months ago
canopyai / Orpheus-TTS
View on GitHub
Towards Human-Sounding Speech
☆6,237Dec 5, 2025Updated 7 months ago
xingchensong / S3Tokenizer
View on GitHub
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
☆517Dec 22, 2025Updated 6 months ago
FireRedTeam / FireRedASR
View on GitHub
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR be…
☆1,935Feb 25, 2026Updated 4 months ago
Plachtaa / seed-vc
View on GitHub
zero-shot voice conversion & singing voice conversion, with real-time support
☆3,870Apr 20, 2025Updated last year
gpt-omni / mini-omni
View on GitHub
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming…
☆3,562Nov 5, 2024Updated last year
BytedanceSpeech / seed-tts-eval
View on GitHub
☆1,573Jun 14, 2024Updated 2 years ago
jishengpeng / WavChat
View on GitHub
A Survey of Spoken Dialogue Models (60 pages)
☆317Nov 28, 2024Updated last year
maitrix-org / Voila
View on GitHub
☆495May 6, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
VITA-MLLM / VITA
View on GitHub
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
☆2,519Mar 28, 2025Updated last year
shivammehta25 / Matcha-TTS
View on GitHub
[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
☆1,331Updated this week
gemelo-ai / vocos
View on GitHub
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
☆1,144Aug 7, 2024Updated last year
X-LANCE / SLAM-LLM
View on GitHub
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
☆1,048Jan 15, 2026Updated 6 months ago
jixiaozhong / Sonic
View on GitHub
Official implementation of "Sonic: Shifting Focus to Global Audio Perception in Portrait Animation"
☆3,266Jan 8, 2026Updated 6 months ago
ga642381 / speech-trident
View on GitHub
Awesome speech/audio LLMs, representation learning, and codec models
☆1,239Updated this week
MatthewCYM / VoiceBench
View on GitHub
[TACL'26] VoiceBench: Benchmarking LLM-Based Voice Assistants
☆379Jun 11, 2026Updated last month