Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
☆54Apr 21, 2023Updated 2 years ago
Alternatives and similar repositories for video-pretrained-transformer
Users that are interested in video-pretrained-transformer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ECCV2022] A PyTorch implementation of the paper "Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embo…☆13Mar 20, 2023Updated 3 years ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆27Mar 13, 2026Updated 2 weeks ago
- [NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale☆207Nov 13, 2023Updated 2 years ago
- Code for DVD A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue☆14Oct 12, 2021Updated 4 years ago
- [EMNLP 2024] PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling☆22Apr 21, 2025Updated 11 months ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- A guide to structured generation using constrained decoding☆14Jun 9, 2024Updated last year
- Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zeta☆16Nov 11, 2024Updated last year
- Implementation of DropCov as described in DropCov: A Simple yet Effective Method for Improving Deep Architectures☆10Oct 15, 2022Updated 3 years ago
- ☆58Dec 2, 2025Updated 3 months ago
- ☆54Apr 24, 2024Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆50Jun 16, 2023Updated 2 years ago
- Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos☆28Dec 8, 2023Updated 2 years ago
- ☆18Aug 19, 2024Updated last year
- [CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval☆64Jun 19, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- MLIR backend for Nx☆14May 24, 2024Updated last year
- Audio-visual diarization pipeline used for creating VoxConverse dataset☆21Jun 6, 2025Updated 9 months ago
- Learning Interactions and Relationships between Movie Characters (CVPR'20)☆22Apr 12, 2023Updated 2 years ago
- Implementation of MambaFormer in Pytorch ++ Zeta from the paper: "Can Mamba Learn How to Learn? A Comparative Study on In-Context Learnin…☆21Updated this week
- Ultra Fast Multi-Modality Vector Database☆18Feb 21, 2024Updated 2 years ago
- ☆13Jul 20, 2022Updated 3 years ago
- ☆27May 4, 2020Updated 5 years ago
- Multimodal and multilingual topic model with pretrained embeddings☆12Apr 11, 2023Updated 2 years ago
- Leveraging DSPy for AI-driven task understanding and solution generation, the Self-Discover Framework automates problem-solving through r…☆75Nov 4, 2025Updated 4 months ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Simulates agent path planning using A* and Q-Learning in a 2D grid☆12Apr 5, 2014Updated 11 years ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆44Mar 11, 2025Updated last year
- Self-supervised place recognition by exploring temporal and feature neighborhoods☆16Dec 9, 2024Updated last year
- Pytorch implementation of a neural network capable of recognizing finger patters on the frets of a guitar.☆10Sep 26, 2021Updated 4 years ago
- Task-Focused Few-Shot Object Detection Benchmark☆14Jun 24, 2025Updated 9 months ago
- Generate interleaved text and image content in a structured format you can directly pass to downstream APIs.☆29Oct 18, 2024Updated last year
- [3DV 2025] VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition☆18Mar 18, 2025Updated last year
- voxelnet on ASTYX radar dataset for automotive object detection☆13Aug 28, 2020Updated 5 years ago
- End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)☆230Jan 3, 2024Updated 2 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- [CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings☆46Aug 14, 2023Updated 2 years ago
- Official Implementation for "SiLVR : A Simple Language-based Video Reasoning Framework"☆19Jan 18, 2026Updated 2 months ago
- [ICCV2023 Oral] Implicit Temporal Modeling with Learnable Alignment for Video Recognition☆41Nov 29, 2023Updated 2 years ago
- SimOn: A Simple Framework for Online Temporal Action Localization☆22Nov 12, 2022Updated 3 years ago
- ☆15Oct 29, 2019Updated 6 years ago
- Research Notes☆11Sep 13, 2020Updated 5 years ago
- Video shot transition detection☆25Mar 9, 2023Updated 3 years ago