Neleac / SpaceTimeGPT
video description generation vision-language model
β17Updated 3 weeks ago
Alternatives and similar repositories for SpaceTimeGPT:
Users that are interested in SpaceTimeGPT are comparing it to the libraries listed below
- π€ [ICLR'25] Multimodal Video Understanding Framework (MVU)β27Updated 2 weeks ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignmentβ40Updated last month
- (WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hβ¦β82Updated this week
- β64Updated last year
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Modelβ41Updated last month
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Propertiesβ118Updated 3 months ago
- Data-Efficient Multimodal Fusion on a Single GPUβ52Updated 9 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024β49Updated 3 weeks ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusionβ40Updated 3 weeks ago
- [NeurIPS2024] VideoGUI: A Benchmark for GUI Automation from Instructional Videosβ29Updated 2 months ago
- Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scraβ¦β53Updated last year
- Code release for the paper "Egocentric Video Task Translation" (CVPR 2023 Highlight)β32Updated last year
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedbackβ59Updated 5 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Modelβ41Updated 6 months ago
- β56Updated 9 months ago
- Language Repository for Long Video Understandingβ31Updated 8 months ago
- β159Updated 4 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"β35Updated 6 months ago
- β72Updated 9 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"β18Updated 3 months ago
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Modelβ41Updated last month
- Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorchβ99Updated last year
- Video-LlaVA fine-tune for CinePile evaluationβ46Updated 6 months ago
- Official code for our CVPR 2023 paper: Test of Time: Instilling Video-Language Models with a Sense of Timeβ45Updated 8 months ago
- LAVIS - A One-stop Library for Language-Vision Intelligenceβ47Updated 6 months ago
- LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1β100Updated last week
- β18Updated 10 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, β¦β99Updated last week
- TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generationβ29Updated 2 months ago