zhaoyucs / VSD
Code for "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation"
β27Updated last year
Alternatives and similar repositories for VSD
Users that are interested in VSD are comparing it to the libraries listed below
Sorting:
- [ICCV2023] Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control"β53Updated last year
- [CVPR 2025] ππ EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answeringβ32Updated 2 weeks ago
- β23Updated 2 years ago
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Modelsβ37Updated last year
- β58Updated last year
- [CVPR-2023] The official dataset of Advancing Visual Grounding with Scene Knowledge: Benchmark and Method.β30Updated last year
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Modelβ43Updated 4 months ago
- The official GitHub page for ''What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Insβ¦β19Updated last year
- Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decodingβ45Updated last year
- Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Modelsβ84Updated 8 months ago
- β30Updated 9 months ago
- β25Updated last year
- πΎ E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)β58Updated 3 months ago
- [ICML 2024] Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuningβ49Updated last year
- β12Updated 9 months ago
- (ICCV 2023) Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentationβ47Updated 10 months ago
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Modelsβ32Updated 6 months ago
- [ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrievalβ16Updated 2 years ago
- FreeVA: Offline MLLM as Training-Free Video Assistantβ61Updated 11 months ago
- CVPR2022 - Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentationβ23Updated 2 years ago
- The offical implemention of JM3D.β30Updated 3 weeks ago
- Official Implementation for CVPR 2022 paper "Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language β¦β24Updated 2 years ago
- Turning to Video for Transcript Sortingβ48Updated last year
- β37Updated 2 years ago
- β34Updated 8 months ago
- β17Updated 2 weeks ago
- Code and Dataset for the CVPRW Paper "Where did I leave my keys? β Episodic-Memory-Based Question Answering on Egocentric Videos"β25Updated last year
- LAVIS - A One-stop Library for Language-Vision Intelligenceβ47Updated 9 months ago
- 𦩠Visual Instruction Tuning with Polite Flamingo - training multi-modal LLMs to be both clever and polite! (AAAI-24 Oral)β64Updated last year
- β35Updated last year