CASIA-IVA-Lab/VAST

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CASIA-IVA-Lab/VAST)

CASIA-IVA-Lab / VAST

[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

☆301

Alternatives and similar repositories for VAST

Users that are interested in VAST are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

CASIA-IVA-Lab / VALOR
View on GitHub
[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
☆310Dec 25, 2024Updated last year
CASIA-IVA-Lab / OPT_Questioner
View on GitHub
Official PyTorch implementation of the paper "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner"
☆15Aug 9, 2023Updated 2 years ago
CASIA-IVA-Lab / COSA
View on GitHub
[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Dec 25, 2024Updated last year
X-PLUG / mPLUG-2
View on GitHub
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
☆227Jul 21, 2023Updated 2 years ago
OpenGVLab / InternVideo
View on GitHub
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
☆2,264Mar 25, 2026Updated last month
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
OpenGVLab / unmasked_teacher
View on GitHub
[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
☆348May 27, 2024Updated last year
DCDmllm / Momentor
View on GitHub
☆81Nov 24, 2024Updated last year
XinhaoMei / WavCaps
View on GitHub
This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
☆261Jul 25, 2024Updated last year
PKU-YuanGroup / LanguageBind
View on GitHub
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
☆878Mar 25, 2024Updated 2 years ago
whwu95 / Cap4Video
View on GitHub
【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
☆257Nov 29, 2024Updated last year
microsoft / XPretrain
View on GitHub
Multi-modality pre-training
☆512Mar 27, 2026Updated last month
OpenGVLab / VideoChat-Flash
View on GitHub
[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
☆523Nov 18, 2025Updated 6 months ago
YuanGongND / cav-mae
View on GitHub
Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
☆291Mar 20, 2024Updated 2 years ago
DAMO-NLP-SG / Video-LLaMA
View on GitHub
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
☆3,145Jun 4, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ArrowLuo / CLIP4Clip
View on GitHub
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
☆1,027Apr 12, 2024Updated 2 years ago
mlvlab / vid-TLDR
View on GitHub
Official implementation of CVPR 2024 paper "vid-TLDR: Training Free Token merging for Light-weight Video Transformer".
☆55Oct 21, 2025Updated 7 months ago
RenShuhuai-Andy / TimeChat
View on GitHub
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
☆421May 8, 2025Updated last year
XinhaoMei / ACT
View on GitHub
Source code for the paper 'Audio Captioning Transformer'
☆56Jan 18, 2022Updated 4 years ago
callsys / TextVR
View on GitHub
[PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension
☆31Dec 28, 2023Updated 2 years ago
wenhaochai / MovieChat
View on GitHub
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
☆698Jan 29, 2025Updated last year
GenjiB / ECLIPSE
View on GitHub
☆33Mar 10, 2023Updated 3 years ago
YapengTian / AVVP-ECCV20
View on GitHub
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing, ECCV, 2020. (Spotlight)
☆90Jul 25, 2024Updated last year
jpthu17 / EMCL
View on GitHub
[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
☆148Apr 9, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
sangho-vision / acav100m
View on GitHub
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.
☆64Nov 18, 2021Updated 4 years ago
CNVid / CNVid-3.5M
View on GitHub
This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale P…
☆25Nov 28, 2023Updated 2 years ago
DAMO-NLP-SG / VideoLLaMA2
View on GitHub
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
☆1,297Jan 23, 2025Updated last year
stoneMo / SLAVC
View on GitHub
Official Codebase of "A Closer Look at Weakly-Supervised Audio-Visual Source Localization" (NeurIPS 2022)
☆21Dec 6, 2022Updated 3 years ago
jaeyeonkim99 / EnCLAP
View on GitHub
Official Implementation of EnCLAP (ICASSP 2024)
☆95Jun 2, 2024Updated last year
rikeilong / Bay-CAT
View on GitHub
[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…
☆59Sep 4, 2024Updated last year
antoyang / VidChapters
View on GitHub
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
☆208Nov 13, 2023Updated 2 years ago
Ahnsun / merlin
View on GitHub
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆97Jul 4, 2024Updated last year
lntzm / MESM
View on GitHub
The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)
☆32Mar 29, 2024Updated 2 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
JishengBai / AudioSetCaps
View on GitHub
A 6-million Audio-Caption Paired Dataset Built with a LLMs and ALMs-based Automatic Pipeline
☆202Dec 13, 2024Updated last year
microsoft / CLAP
View on GitHub
Learning audio concepts from natural language supervision
☆662Sep 18, 2024Updated last year
QQ-MM / Video-CCAM
View on GitHub
A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.
☆74Oct 14, 2024Updated last year
GenjiB / LAVISH
View on GitHub
Vision Transformers are Parameter-Efficient Audio-Visual Learners
☆106Aug 11, 2023Updated 2 years ago
mbzuai-oryx / Video-ChatGPT
View on GitHub
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the cap…
☆1,498Aug 5, 2025Updated 9 months ago
antoyang / FrozenBiLM
View on GitHub
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
☆159Dec 9, 2024Updated last year
huangb23 / VTimeLLM
View on GitHub
[CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".
☆297Jun 13, 2024Updated last year