MikeWangWZHL / VDLMLinks

Repo for paper: https://arxiv.org/abs/2404.06479

☆28

Alternatives and similar repositories for VDLM

Users that are interested in VDLM are comparing it to the libraries listed below

Sorting:

TIGER-AI-Lab / MEGA-Bench
This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR2025]
☆74Updated 2 months ago
para-lost / AutoPresent
Code for the paper "AutoPresent: Designing Structured Visuals From Scratch" (CVPR 2025)
☆122Updated 3 months ago
zeyofu / BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…
☆137Updated last year
kokolerk / TON
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
☆40Updated 3 weeks ago
chenllliang / DnD-Transformer
[ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…
☆77Updated 8 months ago
pipilurj / G-LLaVA
Official github repo of G-LLaVA
☆147Updated 6 months ago
facebookresearch / multimodal_rewardbench
Multimodal RewardBench
☆46Updated 6 months ago
declare-lab / LLM-PuzzleTest
This repository is maintained to release dataset and models for multimodal puzzle reasoning.
☆101Updated 6 months ago
kaistAI / Volcano
[NAACL 2024] Vision language model that reduces hallucinations through self-feedback guided revision. Visualizes attentions on image feat…
☆46Updated last year
VisualWebBench / VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
☆59Updated 10 months ago
WildVision-AI / LMM-Engines
☆16Updated 10 months ago
Yushi-Hu / VisualSketchpad
Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
☆254Updated last month
zeyofu / ReFocus_Code
Codes for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding [ICML 2025]]
☆37Updated last month
JieyuZ2 / TaskMeAnything
[NeurIPS 2024] A task generation and model evaluation system for multimodal language models.
☆73Updated 9 months ago
google / storybench
☆50Updated last year
sail-sg / AnytimeReasoner
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
☆45Updated last month
Victorwz / MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
☆65Updated 4 months ago
wmn-231314 / diffusion-data-constraint
Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…
☆88Updated last week
UW-Madison-Lee-Lab / CoBSAT
Implementation and dataset for paper "Can MLLMs Perform Text-to-Image In-Context Learning?"
☆41Updated 3 months ago
TencentARC / GRPO-CARE
☆72Updated 2 months ago
Lizw14 / Super-CLEVR
Code for paper "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning"
☆45Updated last year
orrzohar / Video-STaR
[ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆69Updated last year
beichenzbc / BoostStep
official code for "BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning"
☆36Updated 7 months ago
shulin16 / MMInA
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆46Updated 6 months ago
EvolvingLMMs-Lab / multimodal-sae
[ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.
☆150Updated last month
YuxiXie / V-DPO
Preference Learning for LLaVA
☆49Updated 9 months ago
yale-nlp / MMVU
Data and Code for CVPR 2025 paper "MMVU: Measuring Expert-Level Multi-Discipline Video Understanding"
☆72Updated 6 months ago
chenllliang / G1
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
☆79Updated 3 months ago
bigai-nlco / LatentSeek
Official Repository of LatentSeek
☆60Updated 3 months ago
pipilurj / bootstrapped-preference-optimization-BPO
code for "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization"
☆59Updated last year