Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
☆54Apr 21, 2023Updated 2 years ago
Alternatives and similar repositories for video-pretrained-transformer
Users that are interested in video-pretrained-transformer are comparing it to the libraries listed below
Sorting:
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆27Jan 17, 2026Updated last month
- A guide to structured generation using constrained decoding☆14Jun 9, 2024Updated last year
- [ECCV2022] A PyTorch implementation of the paper "Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embo…☆13Mar 20, 2023Updated 2 years ago
- [NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale☆204Nov 13, 2023Updated 2 years ago
- ☆18Aug 19, 2024Updated last year
- Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta☆16Nov 11, 2024Updated last year
- Ultra Fast Multi-Modality Vector Database☆18Feb 21, 2024Updated 2 years ago
- Audio-visual diarization pipeline used for creating VoxConverse dataset☆21Jun 6, 2025Updated 9 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆50Jun 16, 2023Updated 2 years ago
- ☆54Apr 24, 2024Updated last year
- Implementation of MambaFormer in Pytorch ++ Zeta from the paper: "Can Mamba Learn How to Learn? A Comparative Study on In-Context Learnin…☆21Feb 9, 2026Updated last month
- [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"☆151Sep 10, 2024Updated last year
- Code for "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation"☆26Mar 9, 2024Updated 2 years ago
- Video shot transition detection☆25Mar 9, 2023Updated 3 years ago
- ☆21May 11, 2025Updated 9 months ago
- A simpler Pytorch + Zeta Implementation of the paper: "SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series…☆28Nov 11, 2024Updated last year
- Implementation of the paper: "BRAVE : Broadening the visual encoding of vision-language models"☆26Feb 6, 2026Updated last month
- [CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval☆64Jun 19, 2024Updated last year
- Supercharged BLIP-2 that can handle videos☆124Dec 1, 2023Updated 2 years ago
- Implementation of the "the first large-scale multimodal mixture of experts models." from the paper: "Multimodal Contrastive Learning with…☆36Jan 31, 2026Updated last month
- Uncovering Selective State Space Model's Capabilities in Lifelong Sequential Recommendation☆34May 8, 2024Updated last year
- Context Free Grammar(CFG) parser library and application written in Python.☆27Nov 22, 2023Updated 2 years ago
- The official code of Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval (AAAI2024)☆32Mar 29, 2024Updated last year
- A "framework" to work locally on your Custom Coded Action and execute it in the same context as HubSpot.☆10Feb 22, 2024Updated 2 years ago
- [CVPR 2024] Official PyTorch implementation of the paper "One For All: Video Conversation is Feasible Without Video Instruction Tuning"☆35Feb 2, 2024Updated 2 years ago
- The project is an official implementation of our paper " RSGNet: Relation based Skeleton Graph Network for Crowded Scenes Pose Estimation…☆10Dec 9, 2020Updated 5 years ago
- Implementation of the model from "Faster sorting algorithms discovered using deep reinforcement learning" that discovered an all-new ult…☆11Aug 29, 2023Updated 2 years ago
- PyTorch implementation for "Rethinking Low-quality Optical Flow in Unsupervised Surgical Instrument Segmentation"☆10Apr 11, 2024Updated last year
- Repository dedicated to developing a robust and modular framework for Multi-Agent Reinforcement Learning (MARL) algorithms.☆13Mar 3, 2024Updated 2 years ago
- ☆34Jun 2, 2023Updated 2 years ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆43Mar 11, 2025Updated 11 months ago
- Code snippets to build Elementor Plugin widgets☆13Oct 13, 2022Updated 3 years ago
- ☆14Jul 3, 2024Updated last year
- [ISBI 2024] Official PyTorch implementation of Towards Cross-Domain Single Blood Cell Image Classification via Large-Scale LoRA-based Seg…☆11Aug 12, 2024Updated last year
- [CVPR 2021] FMO Deblurring Benchmark☆13Jan 12, 2022Updated 4 years ago
- An open-source non-official community implementation of the model from the paper: Surgical Robot Transformer (SRT): Imitation Learning fo…☆11Feb 9, 2026Updated last month
- 🔥 大模型 & Agent 面试八股文完全指南 | LLM & Agent Interview Preparation Guide☆50Feb 28, 2026Updated last week
- ☆13Mar 21, 2024Updated last year
- Code repository supporting the paper "Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segment…☆11Apr 29, 2024Updated last year