Neleac / SpaceTimeGPTLinks
video description generation vision-language model
☆19Updated 7 months ago
Alternatives and similar repositories for SpaceTimeGPT
Users that are interested in SpaceTimeGPT are comparing it to the libraries listed below
Sorting:
- ☆182Updated 10 months ago
- (WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, H…☆83Updated 3 weeks ago
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties☆130Updated 9 months ago
- [CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings☆46Updated 2 years ago
- Video-LlaVA fine-tune for CinePile evaluation☆51Updated last year
- Language Repository for Long Video Understanding☆32Updated last year
- ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)☆16Updated last year
- ☆65Updated last year
- Supercharged BLIP-2 that can handle videos☆120Updated last year
- ☆69Updated last year
- [NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale☆194Updated last year
- Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scra…☆52Updated 2 years ago
- ☆80Updated 11 months ago
- ☆58Updated last year
- Pytorch implementation of Twelve Labs' Video Foundation Model evaluation framework & open embeddings☆29Updated last year
- ☆84Updated 2 years ago
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".☆65Updated 4 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆257Updated 3 weeks ago
- ☆84Updated 11 months ago
- [NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"☆104Updated last year
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆162Updated 8 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆49Updated 2 months ago
- [ICML 2025] This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"☆138Updated last year
- [TMM 2023] VideoXum: Cross-modal Visual and Textural Summarization of Videos☆47Updated last year
- [EMNLP 2024] Official PyTorch implementation code for realizing the technical part of Traversal of Layers (TroL) presenting new propagati…☆97Updated last year
- [NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to im…☆115Updated last year
- Graph learning framework for long-term video understanding☆66Updated last month
- Official implementation of "Describing Differences in Image Sets with Natural Language" (CVPR 2024 Oral)☆120Updated last year
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆69Updated last year
- Matryoshka Multimodal Models☆113Updated 7 months ago