CASIA-IVA-Lab/VALOR

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CASIA-IVA-Lab/VALOR)

CASIA-IVA-Lab / VALOR

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

☆311

Alternatives and similar repositories for VALOR

Users that are interested in VALOR are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

CASIA-IVA-Lab / VAST
View on GitHub
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
☆302Mar 14, 2024Updated 2 years ago
CASIA-IVA-Lab / COSA
View on GitHub
[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Dec 25, 2024Updated last year
CASIA-IVA-Lab / OPT_Questioner
View on GitHub
Official PyTorch implementation of the paper "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner"
☆15Aug 9, 2023Updated 2 years ago
CASIA-IVA-Lab / ChatBridge
View on GitHub
ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without rely…
☆55Sep 4, 2023Updated 2 years ago
X-PLUG / mPLUG-2
View on GitHub
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
☆227Jul 21, 2023Updated 3 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
OpenGVLab / unmasked_teacher
View on GitHub
[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
☆348May 27, 2024Updated 2 years ago
klauscc / VindLU
View on GitHub
☆109Dec 23, 2022Updated 3 years ago
OpenGVLab / InternVideo
View on GitHub
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
☆2,339Jul 2, 2026Updated 2 weeks ago
DAMO-NLP-SG / Video-LLaMA
View on GitHub
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
☆3,139Jun 4, 2024Updated 2 years ago
jpthu17 / EMCL
View on GitHub
[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
☆148Apr 9, 2024Updated 2 years ago
microsoft / XPretrain
View on GitHub
Multi-modality pre-training
☆511Mar 27, 2026Updated 3 months ago
GenjiB / ECLIPSE
View on GitHub
☆33Mar 10, 2023Updated 3 years ago
m-bain / frozen-in-time
View on GitHub
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]
☆377May 19, 2022Updated 4 years ago
ttgeng233 / UnAV
View on GitHub
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
☆73Jan 4, 2026Updated 6 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
mbzuai-oryx / Video-ChatGPT
View on GitHub
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the cap…
☆1,503Aug 5, 2025Updated 11 months ago
GenjiB / LAVISH
View on GitHub
Vision Transformers are Parameter-Efficient Audio-Visual Learners
☆107Aug 11, 2023Updated 2 years ago
farewellthree / STAN
View on GitHub
Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
☆107Jan 28, 2024Updated 2 years ago
XinhaoMei / WavCaps
View on GitHub
This reporsitory contains metadata of WavCaps dataset and codes for downstream tasks.
☆264Jul 25, 2024Updated last year
CASIA-IVA-Lab / MRES
View on GitHub
This repo holds the official code and data for "Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentati…
☆74Jun 3, 2024Updated 2 years ago
schowdhury671 / meerkat
View on GitHub
☆35Jul 9, 2025Updated last year
kyegomez / Mirasol
View on GitHub
Pytorch Implementation of the Model from "MIRASOL3B: A MULTIMODAL AUTOREGRESSIVE MODEL FOR TIME-ALIGNED AND CONTEXTUAL MODALITIES"
☆26Jan 27, 2025Updated last year
zjr2000 / Untrimmed-Video-Feature-Extractor
View on GitHub
A simple and effective feature extractor for untrimmed videos
☆13Sep 1, 2022Updated 3 years ago
GeWu-Lab / MUSIC-AVQA
View on GitHub
MUSIC-AVQA, CVPR2022 (ORAL)
☆100Dec 30, 2022Updated 3 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
Rubics-Xuan / IVG
View on GitHub
This repo holds the official code and data for "Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with H…
☆15May 21, 2024Updated 2 years ago
Yui010206 / SeViLA
View on GitHub
[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering
☆198Jan 14, 2024Updated 2 years ago
researchmm / MM-Diffusion
View on GitHub
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
☆453Jun 5, 2024Updated 2 years ago
LAION-AI / audio-dataset
View on GitHub
Audio Dataset for training CLAP and other models
☆747Jan 8, 2026Updated 6 months ago
showlab / all-in-one
View on GitHub
[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
☆281Mar 25, 2023Updated 3 years ago
antoyang / FrozenBiLM
View on GitHub
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
☆159Dec 9, 2024Updated last year
JinhuaLiang / lam4fsl
View on GitHub
An official repo for the paper "Adapting Language-Audio Models as Few-Shot Audio Learners"
☆31May 31, 2023Updated 3 years ago
PKU-YuanGroup / LanguageBind
View on GitHub
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
☆884Mar 25, 2024Updated 2 years ago
google-research-datasets / videoCC-data
View on GitHub
VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automa…
☆78Dec 5, 2022Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
zjr2000 / GVL
View on GitHub
Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
☆28Dec 8, 2023Updated 2 years ago
XYPB / CondFoleyGen
View on GitHub
Official PyTorch implementation of "Conditional Generation of Audio from Video via Foley Analogies".
☆93Dec 8, 2023Updated 2 years ago
DCDmllm / Momentor
View on GitHub
☆81Nov 24, 2024Updated last year
CASIA-IVA-Lab / SC-Tune
View on GitHub
Official code for CVPR 2024 paper, "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models"
☆16Apr 22, 2024Updated 2 years ago
ArrowLuo / CLIP4Clip
View on GitHub
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
☆1,029Apr 12, 2024Updated 2 years ago
microsoft / UniVL
View on GitHub
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
☆365Jul 25, 2024Updated last year
YapengTian / AVVP-ECCV20
View on GitHub
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing, ECCV, 2020. (Spotlight)
☆90Jul 25, 2024Updated last year