zhaoyucs / VSDLinks

Code for "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation"

☆26

Alternatives and similar repositories for VSD

Users that are interested in VSD are comparing it to the libraries listed below

Sorting:

HenryHZY / VL-PET
[ICCV2023] Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control"
☆52Updated 2 years ago
UniAdapter / UniAdapter
☆26Updated 2 years ago
UMass-Embodied-AGI / CoVLM
[ICLR 2023] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
☆45Updated 4 months ago
PVIT-official / PVIT
Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
☆37Updated 2 years ago
mightyzau / RegionBLIP
☆58Updated 2 years ago
z-x-yang / DoraemonGPT
Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
☆86Updated last year
zhousheng97 / EgoTextVQA
[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
☆38Updated 3 months ago
ChocoWu / SeTok
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
☆75Updated 5 months ago
XLiu443 / Tem-adapter
[ICCV2023] Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
☆37Updated 2 years ago
LuFan31 / CompreCap
CVPR2025: Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
☆35Updated 6 months ago
RERV / UniAdapter
[ICLR2024] The official implementation of paper "UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling", by …
☆76Updated last year
PolyU-ChenLab / ETBench
👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)
☆65Updated 8 months ago
ztyang23 / BACON
☆18Updated last year
jingwangsg / MS-DETR
An official implementation for MS-DETR in ACL'23
☆17Updated 2 years ago
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated 2 years ago
TXH-mercury / COSA
[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Updated 9 months ago
franciszzj / OpenPSG
[ECCV 2024] OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
☆49Updated 9 months ago
RUCAIBox / ComVint
The official GitHub page for ''What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Ins…
☆19Updated last year
chojw / genb
Generative Bias for Robust Visual Question Answering ( CVPR 2023 )
☆27Updated 2 years ago
microsoft / UniTAB
UniTAB: Unifying Text and Box Outputs for Grounded VL Modeling, ECCV 2022 (Oral Presentation)
☆88Updated 2 years ago
franciszzj / VLPrompt
[IJCV 2025] VLPrompt-PSG: Vision-Language Prompting for Panoptic Scene Graph Generation
☆27Updated last year
rentainhe / TRAR-VQA
[ICCV 2021] Official implementation of the paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering"
☆67Updated 4 years ago
LaVi-Lab / Visual-Table
[EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"
☆20Updated last year
yangjie-cv / WeThink
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
☆35Updated 4 months ago
YYJMJC / LOUPE
☆45Updated 2 years ago
UMass-Embodied-AGI / VisualCoT
Codebase for AAAI 2024 conference paper Visual Chain-of-Thought Prompting for Knowledge-based Visual Reasoning
☆33Updated 7 months ago
liunian-harold-li / DesCo
☆30Updated last year
LouChao98 / VLGAE
Official Implementation for CVPR 2022 paper "Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language …
☆24Updated 2 years ago
LeeYN-43 / Clover
Offical PyTorch implementation of Clover: Towards A Unified Video-Language Alignment and Fusion Model (CVPR2023)
☆40Updated 2 years ago
Share14 / ShareGemini
☆31Updated last year