thunlp / MigicianLinks

[ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

☆81

Alternatives and similar repositories for Migician

Users that are interested in Migician are comparing it to the libraries listed below

Sorting:

cnzzx / VSA
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆128Updated last year
ggg0919 / cantor
☆90Updated last year
ZhangXJ199 / TinyLLaVA-Video
A Simple Framework of Small-scale LMMs for Video Understanding
☆103Updated 5 months ago
eric-ai-lab / GRIT
Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"
☆163Updated last month
sterzhang / image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
☆169Updated last year
alibaba / conv-llava
☆123Updated last year
WePOINTS / WePOINTS
☆186Updated 9 months ago
xmu-xiaoma666 / Multimodal-Open-O1
Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…
☆29Updated last year
invictus717 / MiCo
[ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale
☆121Updated last year
BytedanceDouyinContent / SAIL-VL2
The SAIL-VL2 series model developed by the BytedanceDouyinContent Group
☆76Updated 2 months ago
bigai-nlco / VideoLLaMB
[ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
☆78Updated 9 months ago
AFeng-x / Draw-and-Understand
[ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
☆91Updated this week
farewellthree / PPLLaVA
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
☆130Updated last year
TheEighthDay / SeekWorld
The first attempt to replicate o3-like visual clue-tracking reasoning capabilities.
☆61Updated 4 months ago
IVGSZ / Flash-VStream
This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"
☆251Updated last month
Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
☆86Updated last year
EvolvingLMMs-Lab / VideoMMMU
☆62Updated 2 months ago
dongyh20 / Insight-V
[CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
☆229Updated 3 weeks ago
rednote-hilab / dots.vlm1
The official repository of the dots.vlm1 instruct models proposed by rednote-hilab.
☆265Updated 2 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
JiuTian-VL / JiuTian-LION
[CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
☆153Updated 3 months ago
PhoenixZ810 / MG-LLaVA
Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).
☆158Updated last year
MaverickRen / PixelLM
[CVPR 2024] PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding.
☆243Updated 9 months ago
yeliudev / VideoMind
💡 VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
☆284Updated last month
HumanMLLM / ViSpeak
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆40Updated 5 months ago
TempleX98 / MoVA
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
☆168Updated last year
xjtupanda / Sparrow
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
☆48Updated 3 months ago
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆163Updated 11 months ago
OpenGVLab / VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
☆231Updated last month
Kwai-YuanQi / MM-RLHF
The Next Step Forward in Multimodal LLM Alignment
☆189Updated 7 months ago