zhaoyucs / VSD
Code for "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation"
☆26Updated 10 months ago
Alternatives and similar repositories for VSD:
Users that are interested in VSD are comparing it to the libraries listed below
- [ICCV2023] Official code for "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control"☆53Updated last year
- The offical implemention of JM3D.☆28Updated last year
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆37Updated last year
- Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding☆43Updated last year
- ✨A curated list of papers on the uncertainty in multi-modal large language model (MLLM).☆27Updated this week
- ☆26Updated 5 months ago
- [ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval☆16Updated 2 years ago
- VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation☆22Updated 3 months ago
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆27Updated 2 months ago
- 👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)☆50Updated this week
- CVPR2022 - Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation☆23Updated 2 years ago
- ☆22Updated last year
- ☆58Updated last year
- [NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"☆32Updated last month
- FreeVA: Offline MLLM as Training-Free Video Assistant☆54Updated 7 months ago
- [ECCV 2024 Best Paper Candidate] Implementation of "Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Vi…☆48Updated 3 months ago
- [ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model☆41Updated 3 weeks ago
- Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".☆56Updated last year
- Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning☆19Updated 4 months ago
- Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models☆78Updated 4 months ago
- [EMNLP'22] Weakly-Supervised Temporal Article Grounding☆14Updated last year
- [CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lan…☆55Updated last month
- ☆13Updated 3 years ago
- (NeurIPS 2024) What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights☆22Updated 2 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆18Updated last month
- (ICLR 2024, CVPR 2024) SparseFormer☆67Updated 2 months ago
- Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM☆48Updated 3 months ago
- The released data for paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".☆32Updated last year
- Repository for the paper: Teaching VLMs to Localize Specific Objects from In-context Examples☆19Updated last month
- Official Implementation for CVPR 2022 paper "Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language …☆23Updated 2 years ago