Neleac / SpaceTimeGPTLinks

video description generation vision-language model

☆20

Alternatives and similar repositories for SpaceTimeGPT

Users that are interested in SpaceTimeGPT are comparing it to the libraries listed below

Sorting:

NVlabs / LITA
☆188Updated last year
ilkerkesen / ViLMA
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)
☆16Updated last year
yukw777 / VideoBLIP
Supercharged BLIP-2 that can handle videos
☆122Updated last year
facebookresearch / HierVL
[CVPR 2023] HierVL Learning Hierarchical Video-Language Embeddings
☆46Updated 2 years ago
antoyang / VidChapters
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
☆198Updated last year
aimagelab / LLaVA-MORE
[ICCVW 25] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
☆155Updated 3 months ago
mbzuai-oryx / PALO
(WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, H…
☆82Updated 3 months ago
kkahatapitiya / LangRepo
Code for our ACL 2025 paper "Language Repository for Long Video Understanding"
☆32Updated last year
yukw777 / EILEV
EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
☆131Updated 11 months ago
kahnchana / mvu
🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)
☆49Updated 9 months ago
CeeZh / LLoVi
Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
☆101Updated last year
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆70Updated last year
KastanDay / video-pretrained-transformer
Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scra…
☆52Updated 2 years ago
sanjayss34 / codevqa
☆84Updated 2 years ago
facebookresearch / EgoVLPv2
Code release for "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone" [ICCV, 2023]
☆100Updated last year
Hritikbansal / videocon
☆57Updated last year
LAION-AI / General-GPT
☆65Updated 2 years ago
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆162Updated 10 months ago
isekai-portal / Link-Context-Learning
☆99Updated last year
mbzuai-oryx / Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
☆259Updated 3 months ago
bpiyush / TestOfTime
Official code for our CVPR 2023 paper: Test of Time: Instilling Video-Language Models with a Sense of Time
☆46Updated last year
SMILE-data / SMILE
SMILE: A Multimodal Dataset for Understanding Laughter
☆12Updated 2 years ago
AskYoutubeAI / AskVideos-VideoCLIP
☆79Updated last year
imagegridworth / IG-VLM
☆138Updated last year
OpenGVLab / FluxViT
Make Your Training Flexible: Towards Deployment-Efficient Video Models
☆31Updated 4 months ago
Victorwz / MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
☆67Updated 6 months ago
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆54Updated 4 months ago
amazon-science / QA-ViT
☆69Updated last year
joslefaure / HERMES
[ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
☆36Updated last month
LilyDaytoy / OpenPVSG
Benchmarking Panoptic Video Scene Graph Generation (PVSG), CVPR'23
☆99Updated last year