KastanDay / video-pretrained-transformerLinks

Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).

☆52

Alternatives and similar repositories for video-pretrained-transformer

Users that are interested in video-pretrained-transformer are comparing it to the libraries listed below

Sorting:

cg1177 / VideoLLM
VideoLLM: Modeling Video Sequence with Large Language Models
☆158Updated 2 years ago
TXH-mercury / COSA
[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Updated 9 months ago
TencentARC / ViT-Lens
[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
☆182Updated 8 months ago
fabawi / ImageBind-LoRA
Fine-tuning "ImageBind One Embedding Space to Bind Them All" with LoRA
☆191Updated last year
mshukor / UnIVAL
[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.
☆229Updated last year
kyegomez / PALI
Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"
☆91Updated last year
yukw777 / VideoBLIP
Supercharged BLIP-2 that can handle videos
☆122Updated last year
facebookresearch / VLaMP
Code for “Pretrained Language Models as Visual Planners for Human Assistance”
☆61Updated 2 years ago
mbzuai-oryx / Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
☆257Updated 2 months ago
kyegomez / MC-ViT
Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"
☆22Updated last week
facebookresearch / HierVL
[CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings
☆46Updated 2 years ago
facebookresearch / EgoObjects
[ICCV2023] EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding
☆77Updated 2 years ago
IntelLabs / GraVi-T
Graph learning framework for long-term video understanding
☆67Updated 3 months ago
LilyDaytoy / OpenPVSG
Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23
☆97Updated last year
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
☆94Updated last year
facebookresearch / EgoVLPv2
Code release for "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone" [ICCV, 2023]
☆99Updated last year
antoyang / VidChapters
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
☆198Updated last year
kyegomez / PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
☆145Updated 2 weeks ago
NVlabs / LITA
☆185Updated last year
artemisp / LAVIS-XInstructBLIP
LAVIS - A One-stop Library for Language-Vision Intelligence
☆48Updated last year
showlab / assistgpt
☆66Updated 2 years ago
kkahatapitiya / LangRepo
Language Repository for Long Video Understanding
☆32Updated last year
google / video-localized-narratives
☆60Updated 2 years ago
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆53Updated 3 months ago
callsys / TextVR
[PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension
☆28Updated last year
Nicous20 / FunQA
FunQA benchmarks funny, creative, and magic videos for challenging tasks including timestamp localization, video description, reasoning, …
☆104Updated 10 months ago
cliangyu / Cola
[NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"
☆103Updated last year
isekai-portal / Link-Context-Learning
☆99Updated last year
MCR-PEFT / Ex-MCR
☆44Updated 5 months ago
icoz69 / StableLLAVA
Official repo for StableLLAVA
☆94Updated last year