TXH-mercury / COSALinks

[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

☆43

Alternatives and similar repositories for COSA

Users that are interested in COSA are comparing it to the libraries listed below

Sorting:

yangbang18 / MultiCapCLIP
(ACL'2023) MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
☆36Updated last year
RERV / UniAdapter
[ICLR2024] The official implementation of paper "UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling", by …
☆77Updated last year
Nicous20 / FunQA
FunQA benchmarks funny, creative, and magic videos for challenging tasks including timestamp localization, video description, reasoning, …
☆104Updated last year
klauscc / VindLU
☆110Updated 2 years ago
farewellthree / STAN
Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
☆107Updated last year
microsoft / LAVENDER
A Unified Framework for Video-Language Understanding
☆61Updated 2 years ago
showlab / cosmo
☆73Updated last year
j-min / HiREST
Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
☆107Updated 10 months ago
tsujuifu / pytorch_empirical-mvm
A PyTorch implementation of EmpiricalMVM
☆41Updated last year
artemisp / LAVIS-XInstructBLIP
LAVIS - A One-stop Library for Language-Vision Intelligence
☆48Updated last year
mzhaoshuai / CenterCLIP
[SIGIR 2022] CenterCLIP: Token Clustering for Efficient Text-Video Retrieval.
☆133Updated 3 years ago
zhjohnchan / SK-VG
[CVPR-2023] The official dataset of Advancing Visual Grounding with Scene Knowledge: Benchmark and Method.
☆32Updated 2 years ago
DCDmllm / Momentor
☆80Updated last year
Hritikbansal / videocon
☆58Updated last year
LeeYN-43 / Clover
Offical PyTorch implementation of Clover: Towards A Unified Video-Language Alignment and Fusion Model (CVPR2023)
☆40Updated 2 years ago
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆96Updated last year
rxtan2 / Koala-video-llm
☆36Updated last year
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated 2 years ago
Yui010206 / SeViLA
[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering
☆190Updated last year
facebookresearch / video-distant-supervision
This is an official pytorch implementation of Learning To Recognize Procedural Activities with Distant Supervision. In this repository, w…
☆43Updated 2 years ago
xuguohai / X-CLIP
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
☆179Updated last year
lbaermann / qaego4d
Code and Dataset for the CVPRW Paper "Where did I leave my keys? — Episodic-Memory-Based Question Answering on Egocentric Videos"
☆29Updated 2 years ago
liveseongho / Awesome-Video-Language-Understanding
A Survey on video and language understanding.
☆50Updated 2 years ago
ChenDelong1999 / polite-flamingo
🦩 Visual Instruction Tuning with Polite Flamingo - training multi-modal LLMs to be both clever and polite! (AAAI-24 Oral)
☆64Updated last year
TencentARC / TVTS
Turning to Video for Transcript Sorting
☆48Updated 2 years ago
Yaojie-Shen / CoCap
[ICCV 2023] Accurate and Fast Compressed Video Captioning
☆51Updated 4 months ago
yuezih / Movie101
Narrative movie understanding benchmark
☆77Updated 5 months ago
microsoft / UniTAB
UniTAB: Unifying Text and Box Outputs for Grounded VL Modeling, ECCV 2022 (Oral Presentation)
☆89Updated 2 years ago
YoadTew / zero-shot-video-to-text
☆76Updated 3 years ago
google-research-datasets / videoCC-data
VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automa…
☆78Updated 3 years ago