JaesungHuh/ca-subtitle

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/JaesungHuh/ca-subtitle)

JaesungHuh / ca-subtitle

Implementation of "Look, Listen and Recognise:character-aware audio-visual subtitling"

☆21

Alternatives and similar repositories for ca-subtitle

Users that are interested in ca-subtitle are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zaocan666 / DyViSE
View on GitHub
Dynamic vision-guided speaker embedding for audio-visual speaker diarization
☆12Jul 5, 2022Updated 4 years ago
stepfun-ai / StepAudio-Skills
View on GitHub
Audio skills for Claw
☆27Apr 16, 2026Updated 3 months ago
yfyeung / DS-WED
View on GitHub
[ICASSP 2026] Official code for "Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration"
☆17Apr 16, 2026Updated 3 months ago
Blinorot / utmos-pytorch
View on GitHub
Unofficial fairseq-free PyTorch implementation of UTMOS (v1, 2022), matching the original system.
☆35Jun 6, 2026Updated last month
juice500ml / xlm_to_xlsr
View on GitHub
Official implementation of the paper "Distilling a Pretrained Language Model to a Multilingual ASR Model" (Interspeech 2022)
☆12Mar 12, 2024Updated 2 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
MaikeZuefle / f-actor
View on GitHub
☆28Jul 17, 2026Updated last week
zengchang233 / CrossSinger
View on GitHub
The source code for the paper CrossSinger (asru2023)
☆18Oct 12, 2023Updated 2 years ago
zju3dv / StreamingTalker
View on GitHub
Code for "StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model", AAAI2026 Oral
☆55Jun 15, 2026Updated last month
JusperLee / Look2hear
View on GitHub
A toolkit for researchers in the multimodal sound separation.
☆16Oct 20, 2023Updated 2 years ago
SmartSoundKAIST / 6DRIR-DL
View on GitHub
6 DoF Directional Room Impulse Response (RIR) with Dense Loudspeaker Grid
☆17Aug 31, 2023Updated 2 years ago
Wangtk311 / SafeEar-Inference-Test-Script
View on GitHub
SafeEar是由浙大和清华共同开发的一种深度伪声探测模型。这是我撰写的模型推理脚本。我不确定它是否正确，目前我还是初学者，如有问题请原谅我并指出，谢谢！
☆16May 16, 2025Updated last year
Alittleegg / Eureka-Audio
View on GitHub
Eureka-Audio: A 1.7B lightweight audio–language model that matches 7B–30B models on ASR, audio understanding, and paralinguistic reasonin…
☆40Apr 11, 2026Updated 3 months ago
lucidrains / villa-X
View on GitHub
Implementation of ViLLA-X, Enhancing Latent Action Modeling in Vision-Language-Action Models
☆23Aug 27, 2025Updated 10 months ago
yfyeung / CLSP
View on GitHub
[ACL 2026 Main] Open-Ended Speaking Style Modeling via Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
☆104Apr 6, 2026Updated 3 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
kyutai-labs / tts_longeval
View on GitHub
☆30Apr 29, 2026Updated 2 months ago
pengzhendong / torchfa
View on GitHub
Torch Audio Forced Aligner for Mixed Chinese (Mandarin or Cantonese) and English.
☆61Sep 5, 2025Updated 10 months ago
Soul-AILab / SAC
View on GitHub
[ACL 2026 Main] Training, inference, and testing of the SAC speech codec model.
☆108Nov 1, 2025Updated 8 months ago
MRSAudio / MRSAudio_Main
View on GitHub
MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
☆43Oct 15, 2025Updated 9 months ago
JaesungHuh / av-diarization
View on GitHub
Audio-visual diarization pipeline used for creating VoxConverse dataset
☆22Jun 6, 2025Updated last year
meituan-longcat / LongCat-Audio-Codec
View on GitHub
LongCat Audio Tokenizer and Detokenizer
☆301May 9, 2026Updated 2 months ago
dengcunqin / noise-reduction
View on GitHub
noise reduction
☆17Jul 3, 2024Updated 2 years ago
ArrayDPS / ArrayDPS
View on GitHub
☆40May 12, 2025Updated last year
ryuclc / CosyVoice2-GRPO
View on GitHub
A simple implementation for improving CosyVoice2 by GRPO method
☆39May 5, 2026Updated 2 months ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
xiaoxiaomiao323 / MSA
View on GitHub
☆16Feb 19, 2026Updated 5 months ago
etzinis / heterogeneous_separation
View on GitHub
Code and data recipes for the paper: Heterogeneous Target Speech Separation
☆44Dec 6, 2022Updated 3 years ago
pengzhendong / audio-pipeline
View on GitHub
☆23Oct 17, 2024Updated last year
inclusionAI / MingTok-Audio
View on GitHub
☆88Feb 24, 2026Updated 5 months ago
OpenMOSS / MOSS-Audio-Tokenizer
View on GitHub
MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, i…
☆248Jun 16, 2026Updated last month
cogmhear / avse_challenge
View on GitHub
COG-MHEAR Audio-Visual Speech Enhancement Challenge
☆48Feb 17, 2026Updated 5 months ago
ozspeech / OZSpeech
View on GitHub
[ACL 2025] OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
☆45Feb 9, 2025Updated last year
AmphionTeam / FlexiCodec
View on GitHub
[ICLR2026] FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
☆50Jul 1, 2026Updated 3 weeks ago
LAION-AI / emotion-annotations
View on GitHub
☆110Jul 15, 2026Updated last week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
hyzhang24 / DuplexSLA
View on GitHub
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
☆104May 20, 2026Updated 2 months ago
uark-cviu / Right2Talk
View on GitHub
[ICCV'21] The Right to Talk: An Audio-Visual Transformer Approach
☆20Aug 2, 2021Updated 4 years ago
yluo42 / SRVQ
View on GitHub
Spherical residual vector quantization (SRVQ)
☆31Aug 25, 2024Updated last year
ddlBoJack / MMAR
View on GitHub
[NeurIPS 2025] Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
☆214Feb 25, 2026Updated 4 months ago
xzf-thu / Mini-Omni-Reasoner
View on GitHub
Mini-Omni-Reasoner: a real-time speech reasoning framework that interleaves silent reasoning tokens with spoken response tokens (“thinkin…
☆166Aug 26, 2025Updated 10 months ago
wx9Songs / MOSS-Music-Data-Pipeline
View on GitHub
☆44Apr 26, 2026Updated 2 months ago
wenet-e2e / wecut
View on GitHub
video cut powered by AI
☆23Nov 15, 2022Updated 3 years ago