zhaoyucs / VSD
Code for "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation"
β27Updated last year
Alternatives and similar repositories for VSD:
Users that are interested in VSD are comparing it to the libraries listed below
- [ICCV2023] Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control"β53Updated last year
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Modelsβ37Updated last year
- πΎ E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)β57Updated 2 months ago
- β25Updated last year
- Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decodingβ45Updated last year
- Official Implementation for CVPR 2022 paper "Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language β¦β23Updated 2 years ago
- Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".β58Updated last year
- β58Updated last year
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Modelβ43Updated 3 months ago
- The offical implemention of JM3D.β29Updated last year
- [ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrievalβ16Updated 2 years ago
- β23Updated 2 years ago
- Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Modelsβ83Updated 6 months ago
- β16Updated last year
- The official GitHub page for ''What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Insβ¦β19Updated last year
- Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025β40Updated 2 weeks ago
- β29Updated 8 months ago
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Modelsβ29Updated 4 months ago
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?β40Updated last week
- [ECCV'22 Poster] Explicit Image Caption Editingβ21Updated 2 years ago
- UniTAB: Unifying Text and Box Outputs for Grounded VL Modeling, ECCV 2022 (Oral Presentation)β85Updated last year
- VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generationβ25Updated 6 months ago
- Code and Dataset for the CVPRW Paper "Where did I leave my keys? β Episodic-Memory-Based Question Answering on Egocentric Videos"β23Updated last year
- β91Updated last year
- [CVPR 2025] ππ EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answeringβ24Updated 2 weeks ago
- [ICCV2023] Tem-adapter: Adapting Image-Text Pretraining for Video Question Answerβ36Updated last year
- Repository of our accepted CVPR2022 paper "Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Laβ¦β28Updated 3 years ago
- [CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lanβ¦β58Updated last week
- β¨A curated list of papers on the uncertainty in multi-modal large language model (MLLM).β39Updated last week
- Official Implementation of Learning Navigational Visual Representations with Semantic Map Supervision (ICCV2023)β25Updated last year