[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
☆58Sep 4, 2024Updated last year
Alternatives and similar repositories for Bay-CAT
Users that are interested in Bay-CAT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆36Jul 9, 2025Updated 9 months ago
- ☆15Jan 16, 2024Updated 2 years ago
- PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba☆59Nov 14, 2024Updated last year
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆31Dec 23, 2024Updated last year
- ☆24Feb 13, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- This repository contains code for AAAI2025 paper "Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal …☆24Aug 18, 2025Updated 8 months ago
- ☆19May 19, 2024Updated last year
- Question-Aware Gaussian Experts for Audio-Visual Question Answering -- Official Pytorch Implementation (CVPR'25, Highlight)☆29Jun 6, 2025Updated 10 months ago
- Official PyTorch code of GroundVQA (CVPR'24)☆64Sep 13, 2024Updated last year
- ☆16May 1, 2025Updated 11 months ago
- TTRV: Test-Time Reinforcement Learning for Vision–Language Models (CVPR 2026)☆39Mar 8, 2026Updated last month
- ☆44May 20, 2025Updated 11 months ago
- [ICML'25 Spotlight] Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models☆52Jan 21, 2026Updated 3 months ago
- ☆12Aug 25, 2023Updated 2 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- ☆29Feb 27, 2025Updated last year
- [NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset☆300Mar 14, 2024Updated 2 years ago
- Official codebase for "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling".