SHI-Labs / OLA-VLMLinks

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024

☆61

Alternatives and similar repositories for OLA-VLM

Users that are interested in OLA-VLM are comparing it to the libraries listed below

Sorting:

yihedeng9 / OpenVLThinker
OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement
☆108Updated last month
kyegomez / MC-ViT
Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"
☆23Updated last week
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆59Updated last month
agents-x-project / PyVision
Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."
☆125Updated last month
sunsmarterjie / ChatterBox
[AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues
☆57Updated 4 months ago
om-ai-lab / ZoomEye
[EMNLP-2025] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
☆52Updated 2 weeks ago
callsys / GMPO
Geometric-Mean Policy Optimization
☆74Updated last month
SparksJoe / Prism
A Framework for Decoupling and Assessing the Capabilities of VLMs
☆43Updated last year
Hao840 / ADEM-VL
PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"
☆20Updated 10 months ago
XiaoduoAILab / XmodelVLM
☆69Updated last year
iancovert / locality-alignment
☆52Updated 7 months ago
jihaonew / MM-Instruct
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
☆35Updated last year
FreedomIntelligence / LongLLaVA
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
☆211Updated 8 months ago
si0wang / ThinkLite-VL
☆95Updated 3 months ago
shulin16 / MMInA
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆46Updated 6 months ago
microsoft / x-reasoner
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
☆47Updated 4 months ago
NVlabs / STL
Official Pytorch Implementation of Self-emerging Token Labeling
☆35Updated last year
yannqi / R-4B
The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"
☆103Updated last week
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆162Updated 8 months ago
FudanNLPLAB / MouSi
☆74Updated last year
ByungKwanLee / Phantom
[Technical Report] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with …
☆61Updated 11 months ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆47Updated last month
mbzuai-oryx / VideoMolmo
Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"
☆49Updated 2 months ago
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆25Updated 8 months ago
m1k2zoo / negbench
Evaluation and dataset construction code for the CVPR 2025 paper "Vision-Language Models Do Not Understand Negation"
☆30Updated 4 months ago
markywg / transagent
[NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
☆24Updated 10 months ago
top-yun / SPARK
A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.
☆19Updated 8 months ago
WeihuangLin / INF-LLaVA
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
☆42Updated last year
kxfan2002 / SophiaVL-R1
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
☆76Updated last month
mfarre / Video-LLaVA-7B-hf-CinePile
Video-LlaVA fine-tune for CinePile evaluation
☆51Updated last year