rikeilong / Bay-CATLinks

[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

☆57

Alternatives and similar repositories for Bay-CAT

Users that are interested in Bay-CAT are comparing it to the libraries listed below

Sorting:

schowdhury671 / meerkat
☆34Updated 4 months ago
GenjiB / LAVISH
Vision Transformers are Parameter-Efficient Audio-Visual Learners
☆106Updated 2 years ago
ttgeng233 / UniAV
Unified Audio-Visual Perception for Multi-Task Video Localization
☆30Updated last year
GeWu-Lab / TSPM
Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.
☆17Updated last year
ttgeng233 / LongVALE
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))
☆52Updated 5 months ago
JacobChalk / TIM
Codebase for the paper: "TIM: A Time Interval Machine for Audio-Visual Action Recognition"
☆46Updated last year
ttgeng233 / UnAV
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
☆69Updated last year
haoyi-duan / DG-SCT
NeurIPS'2023 official implementation code
☆68Updated 2 years ago
jasongief / OV-AVEL
[2025 CVPR] Towards Open-Vocabulary Audio-Visual Event Localization
☆36Updated 8 months ago
yannqi / COMBO-AVS
[CVPR 2024 Highlight] Official implementation of the paper: Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-…
☆39Updated 7 months ago
AIM-SKKU / QA-TIGER
Question-Aware Gaussian Experts for Audio-Visual Question Answering -- Official Pytorch Implementation (CVPR'25, Highlight)
☆24Updated 5 months ago
jinxiang-liu / anno-free-AVS
Official code for WACV 2024 paper, "Annotation-free Audio-Visual Segmentation"
☆35Updated last year
fyyCS / LSLD
☆14Updated 2 years ago
EasonXiao-888 / UVCOM
[CVPR 2024] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
☆111Updated last year
sangmin-git / MMSI
Code for "Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations" (CVPR 2024 Oral)
☆17Updated last year
gyxxyg / TRACE
[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
☆136Updated 2 months ago
GeWu-Lab / Crab
[CVPR 2025] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
☆75Updated 3 weeks ago
yunlong10 / AVicuna
[AAAI 2025] Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
☆33Updated 7 months ago
TengdaHan / AutoAD
[CVPR'23 Highlight] AutoAD: Movie Description in Context.
☆100Updated last year
GeWu-Lab / MUSIC-AVQA
MUSIC-AVQA, CVPR2022 (ORAL)
☆90Updated 2 years ago
Exploring-Embodied-Emotion-official / E3
☆19Updated 4 months ago
TXH-mercury / VAST
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
☆292Updated last year
Lzq5 / Video-Text-Alignment
☆25Updated 4 months ago
ailab-kyunghee / CM2_DVC
[CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval
☆63Updated last year
Lilidamowang / T2VIndexer-generativeSearch
☆12Updated last year
jinhyunj / EaTR
Official pytorch repository for "Knowing Where to Focus: Event-aware Transformer for Video Grounding" (ICCV 2023)
☆53Updated 2 years ago
j-min / HiREST
Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
☆107Updated 9 months ago
MRHiSum / MR.HiSum
☆45Updated last year
jpthu17 / DiffusionRet
[ICCV 2023] DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
☆137Updated last year
Yaojie-Shen / CoCap
[ICCV 2023] Accurate and Fast Compressed Video Captioning
☆50Updated 3 months ago