KastanDay / video-pretrained-transformer
Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
☆52Updated last year
Alternatives and similar repositories for video-pretrained-transformer:
Users that are interested in video-pretrained-transformer are comparing it to the libraries listed below
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆41Updated last month
- ☆72Updated 8 months ago
- Language Repository for Long Video Understanding☆31Updated 7 months ago
- [CVPR 2023] Official code for "Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations"☆52Updated last year
- Multimodal Video Understanding Framework (MVU)☆27Updated 8 months ago
- [ICCV2023] EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding☆75Updated last year
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆97Updated 2 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆92Updated 5 months ago
- ☆17Updated 9 months ago
- [CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings☆45Updated last year
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆89Updated 6 months ago
- ☆41Updated last year
- VideoLLM: Modeling Video Sequence with Large Language Models☆154Updated last year
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆88Updated 10 months ago
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated 5 months ago
- ☆156Updated 3 months ago
- Supercharged BLIP-2 that can handle videos☆117Updated last year
- Code release for "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone" [ICCV, 2023]☆93Updated 6 months ago
- ☆130Updated 4 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆24Updated last year
- ☆31Updated last year
- Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)☆96Updated last week
- [NeurIPS2024] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆27Updated last month
- Code release for the paper "Egocentric Video Task Translation" (CVPR 2023 Highlight)☆32Updated last year
- ☆92Updated 8 months ago
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties☆118Updated 2 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆39Updated last week
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆19Updated this week
- [CVPR 2024 Champions] Solutions for EgoVis Chanllenges in CVPR 2024☆114Updated 6 months ago
- A Survey on video and language understanding.☆48Updated last year