Neleac / SpaceTimeGPTLinks
video description generation vision-language model
☆19Updated 5 months ago
Alternatives and similar repositories for SpaceTimeGPT
Users that are interested in SpaceTimeGPT are comparing it to the libraries listed below
Sorting:
- PyTorch implementation of "Sample- and Parameter-Efficient Auto-Regressive Image Models" from CVPR 2025☆12Updated 4 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆48Updated 2 weeks ago
- ☆98Updated last year
- [ICLR 2023] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding☆45Updated last month
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆67Updated 10 months ago
- ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)☆16Updated last year
- Language Repository for Long Video Understanding☆31Updated last year
- ☆179Updated 9 months ago
- Code for our ICLR 2024 paper "PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts"☆77Updated last year
- Action Scene Graphs for Long-Form Understanding of Egocentric Videos (CVPR 2024)☆41Updated 3 months ago
- ☆35Updated 10 months ago
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆43Updated 6 months ago
- ☆58Updated last year
- LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning☆140Updated 2 months ago
- [NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to im…☆115Updated last year
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆29Updated last week
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 6 months ago
- ☆45Updated last month
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 11 months ago
- Official code for Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation☆32Updated 2 months ago
- Supercharged BLIP-2 that can handle videos☆118Updated last year
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆67Updated 6 months ago
- [NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"☆105Updated last year
- LAVIS - A One-stop Library for Language-Vision Intelligence☆48Updated 11 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.☆18Updated 6 months ago
- Video-LlaVA fine-tune for CinePile evaluation☆51Updated 11 months ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆125Updated 10 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated 4 months ago
- Pytorch implementation of Twelve Labs' Video Foundation Model evaluation framework & open embeddings☆27Updated 10 months ago