Neleac / SpaceTimeGPT
video description generation vision-language model
☆19Updated 3 months ago
Alternatives and similar repositories for SpaceTimeGPT:
Users that are interested in SpaceTimeGPT are comparing it to the libraries listed below
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties☆123Updated 5 months ago
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆43Updated 4 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 8 months ago
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated 8 months ago
- (WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, H…☆84Updated 2 months ago
- ☆173Updated 6 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆58Updated 2 months ago
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆36Updated 2 months ago
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆26Updated last year
- ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)☆16Updated last year
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆73Updated 3 weeks ago
- ☆97Updated 11 months ago
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆57Updated 3 months ago
- ☆88Updated last year
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆50Updated 3 months ago
- Official repo for StableLLAVA☆95Updated last year
- ☆64Updated last year
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆155Updated 4 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆66Updated 7 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆35Updated 8 months ago
- Code and Data for Paper: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data☆34Updated last year
- Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch☆100Updated last year
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆111Updated 3 weeks ago
- Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scra…☆53Updated 2 years ago
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆35Updated 2 weeks ago
- Matryoshka Multimodal Models☆99Updated 3 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆66Updated 2 months ago
- ☆33Updated 7 months ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆121Updated 9 months ago
- ☆57Updated last year