McGill-NLP / diffusion-itmLinks

Code and data setup for the paper "Are Diffusion Models Vision-and-language Reasoners?"

☆33

Alternatives and similar repositories for diffusion-itm

Users that are interested in diffusion-itm are comparing it to the libraries listed below

Sorting:

linzhiqiu / visual_gpt_score
VisualGPTScore for visio-linguistic reasoning
☆27Updated 2 years ago
lezhang7 / Enhance-FineGrained
[CVPR 2024] Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding
☆53Updated 7 months ago
dhg-wei / TOPA
(NeurIPS 2024 Spotlight) TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment
☆31Updated last year
ytaek-oh / vl_compo
☆10Updated last year
showlab / datacentric.vlp
Compress conventional Vision-Language Pre-training data
☆52Updated 2 years ago
TAU-VAILab / hierarcaps
Code and data for the paper "Emergent Visual-Semantic Hierarchies in Image-Text Representations" (ECCV 2024)
☆32Updated last year
arijitray1993 / COLA
COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!
☆25Updated 11 months ago
ExplainableML / ImageSelect
Code for the paper "If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection"
☆27Updated 2 years ago
wjpoom / SPEC
[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
☆49Updated 5 months ago
UW-Madison-Lee-Lab / CoBSAT
Implementation and dataset for paper "Can MLLMs Perform Text-to-Image In-Context Learning?"
☆41Updated 5 months ago
facebookresearch / genecis
Code and Models for "GeneCIS A Benchmark for General Conditional Image Similarity"
☆61Updated 2 years ago
aimagelab / pacscore
[CVPR 2023 & IJCV 2025] Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
☆64Updated 3 months ago
adobe-research / llava-score
☆11Updated last year
ZhangYuanhan-AI / visual_prompt_retrieval
[NeurIPS2023] Official implementation and model release of the paper "What Makes Good Examples for Visual In-Context Learning?"
☆178Updated last year
Zi-hao-Wei / Efficient-Vision-Language-Pre-training-by-Cluster-Masking
[CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.
☆29Updated last year
amitakamath / vl_text_encoders_are_bottlenecks
Code and datasets for "Text encoders are performance bottlenecks in contrastive vision-language models". Coming soon!
☆11Updated 2 years ago
QUVA-Lab / PIN
Official code repo of PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
☆26Updated 10 months ago
eslambakr / HRS_benchmark
☆61Updated 2 years ago
tripletclip / TripletCLIP
[NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"
☆45Updated 11 months ago
RAIVNLab / CREPE
[CVPR23 Highlight] CREPE: Can Vision-Language Foundation Models Reason Compositionally?
☆35Updated 2 years ago
codezakh / LilT
[ICLR 23] Contrastive Aligned of Vision to Language Through Parameter-Efficient Transfer Learning
☆40Updated 2 years ago
ChocoWu / SeTok
Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
☆75Updated 7 months ago
lisadunlap / ALIA
Augmenting with Language-guided Image Augmentation (ALIA)
☆81Updated 2 years ago
sterzhang / PVIT
Official Repository of Personalized Visual Instruct Tuning
☆32Updated 8 months ago
kaist-ami / BEAF
[ECCV’24] Official repository for "BEAF: Observing Before-AFter Changes to Evaluate Hallucination in Vision-language Models"
☆21Updated 7 months ago
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated 2 years ago
showlab / CLVQA
[AAAI2023] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task (Oral)
☆39Updated last year
yuxiaochen1103 / FDT
☆62Updated 2 years ago
dhg-wei / MCL
(ICML 2024) Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning
☆27Updated last year
wuw2019 / LoTLIP
[NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding
☆46Updated 10 months ago