Mozhgan91 / LEOLinks

LEO: A powerful Hybrid Multimodal LLM

☆18

Alternatives and similar repositories for LEO

Users that are interested in LEO are comparing it to the libraries listed below

Sorting:

TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆77Updated 4 months ago
OpenGVLab / PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
☆50Updated 5 months ago
marinero4972 / Open-o3-Video
Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
☆118Updated last week
Haochen-Wang409 / TreeVGR
Official implementation of "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology"
☆70Updated 2 weeks ago
OpenGVLab / Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
☆92Updated 4 months ago
yu-rp / VisualPerceptionToken
☆126Updated 8 months ago
appletea233 / LLaVA-ST
[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
☆77Updated 4 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆125Updated 7 months ago
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆26Updated 11 months ago
SHI-Labs / Slow-Fast-Video-Multimodal-LLM
☆25Updated 7 months ago
yunlong10 / CAT-V
[AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal P…
☆59Updated 3 weeks ago
lxtGH / DenseWorld-1M
Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"
☆115Updated last month
hanghuacs / FineCaption
☆37Updated 5 months ago
zhouyiks / CoLVA
☆40Updated 4 months ago
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆67Updated 10 months ago
xuyang-liu16 / VidCom2
[EMNLP 2025 Main] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
☆38Updated 2 weeks ago
eric-ai-lab / GRIT
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
☆163Updated last month
callsys / ControlCap
[ECCV 2024] ControlCap: Controllable Region-level Captioning
☆79Updated last year
Cooperx521 / ScaleCap
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
☆57Updated 4 months ago
TencentARC / SEED-Bench-R1
☆94Updated 5 months ago
multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆101Updated 3 weeks ago
zhishuifeiqian / VCR-Bench
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
☆32Updated 4 months ago
XMUDeepLIT / AVG-LLaVA
Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"
☆33Updated last year
penghao-wu / visual_jigsaw
☆58Updated 2 weeks ago
zhang9302002 / ThinkingWithVideos
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
☆64Updated last month
Tiezheng11 / Vision-Language-Vision
☆62Updated 4 months ago
mbzuai-oryx / VideoGLaMM
[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
☆90Updated 7 months ago
markywg / transagent
[NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
☆24Updated last year
LaVi-Lab / Visual-Table
[EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"
☆20Updated last year