aimagelab / DiCOLinks

[BMVC 2024 Oral ✨] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

☆18

Alternatives and similar repositories for DiCO

Users that are interested in DiCO are comparing it to the libraries listed below

Sorting:

wjpoom / SPEC
[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
☆44Updated last month
tian1327 / SWAT
[CVPR 2025] Few-shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
☆20Updated 3 weeks ago
m1k2zoo / negbench
Evaluation and dataset construction code for the CVPR 2025 paper "Vision-Language Models Do Not Understand Negation"
☆27Updated 2 months ago
iancovert / locality-alignment
☆51Updated 6 months ago
eric-ai-lab / ComCLIP
Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
☆35Updated 11 months ago
uvavision / SyViC
[ICCV 2023] Going Beyond Nouns With Vision & Language Models Using Synthetic Data
☆12Updated last year
mlvlab / RALF
Official implementation of CVPR 2024 paper "Retrieval-Augmented Open-Vocabulary Object Detection".
☆41Updated 10 months ago
aimagelab / ReT
[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
☆18Updated 3 months ago
locuslab / llava-token-compression
☆42Updated 8 months ago
layer6ai-labs / fusemix
Data-Efficient Multimodal Fusion on a Single GPU
☆66Updated last year
google / haloquest
☆20Updated 11 months ago
tripletclip / TripletCLIP
[NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"
☆41Updated 7 months ago
sterzhang / PVIT
Official Repository of Personalized Visual Instruct Tuning
☆31Updated 4 months ago
aimagelab / pacscore
[CVPR 2023] Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
☆62Updated 4 months ago
rui-qian / READ
Rui Qian, Xin Yin, Dejing Dou†: Reasoning to Attend: Try to Understand How <SEG> Token Works (CVPR 2025)
☆38Updated 2 months ago
hananshafi / llmblueprint
[ICLR 2024] Official code for the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts"
☆80Updated last year
V-STaR-Bench / V-STaR
Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
☆24Updated last week
linzhiqiu / CLIP-FlanT5
Training code for CLIP-FlanT5
☆26Updated 11 months ago
opendatalab / CLIP-Parrot-Bias
ECCV2024_Parrot Captions Teach CLIP to Spot Text
☆66Updated 10 months ago
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆59Updated 5 months ago
SHI-Labs / OLA-VLM
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024
☆60Updated 4 months ago
adobe-research / llava-score
☆11Updated 9 months ago
lorebianchi98 / FG-CLIP
[CBMI2024 Best Paper] Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".
☆27Updated 2 months ago
heliossun / SQ-LLaVA
Visual self-questioning for large vision-language assistant.
☆41Updated 9 months ago
wuw2019 / LoTLIP
[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
☆43Updated 6 months ago
Vision-CAIR / Infinibench
Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
☆15Updated last month
LgQu / TIGeR
Code for paper: Unified Text-to-Image Generation and Retrieval
☆15Updated last year
alhojel / visual_task_vectors
☆38Updated last year
QUVA-Lab / PIN
Official code repo of PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
☆26Updated 6 months ago
om-ai-lab / ZoomEye
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
☆46Updated 6 months ago