zhangfaen / finetune-Qwen2-VLLinks

☆379

Alternatives and similar repositories for finetune-Qwen2-VL

Users that are interested in finetune-Qwen2-VL are comparing it to the libraries listed below

Sorting:

zjysteven / lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…
☆353Updated 3 weeks ago
TIGER-AI-Lab / VLM2Vec
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
☆473Updated last week
RLHF-V / RLAIF-V
[CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
☆423Updated 6 months ago
TinyLLaVA / TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
☆916Updated 6 months ago
thunlp / LLaVA-UHD
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
☆390Updated last week
Fancy-MLLM / R1-Onevision
R1-onevision, a visual language model capable of deep CoT reasoning.
☆570Updated 7 months ago
VectorSpaceLab / MegaPairs
[ACL 2025 Oral] 🔥🔥 MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
☆234Updated 2 weeks ago
turningpoint-ai / VisualThinker-R1-Zero
Explore the Multimodal “Aha Moment” on 2B Model
☆615Updated 8 months ago
microsoft / LLM2CLIP
LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.
☆563Updated 4 months ago
Osilly / Vision-R1
This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages …
☆727Updated 2 months ago
2U1 / Qwen-VL-Series-Finetune
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
☆1,391Updated 3 weeks ago
OpenGVLab / OmniCorpus
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
☆402Updated 6 months ago
Victorwz / Open-Qwen2VL
[COLM 2025] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
☆283Updated 2 months ago
EvolvingLMMs-Lab / open-r1-multimodal
A fork to add multimodal model training to open-r1
☆1,421Updated 9 months ago
Visual-Agent / DeepEyes
☆971Updated last month
BAAI-DCAI / Bunny
A family of lightweight multimodal models.
☆1,046Updated last year
sandy1990418 / Finetune-Qwen2.5-VL
Fine-tuning Qwen2.5-VL for vision-language tasks | Optimized for Vision understanding | LoRA & PEFT support.
☆141Updated 9 months ago
RLHF-V / RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
☆296Updated last year
TideDra / lmm-r1
Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.
☆829Updated 6 months ago
ModalMinds / MM-EUREKA
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
☆760Updated 2 months ago
SkyworkAI / Vitron
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
☆576Updated last year
pkunlp-icler / FastV
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…
☆515Updated 10 months ago
zai-org / CogCoM
☆215Updated last year
Yuliang-Liu / MultimodalOCR
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
☆749Updated 4 months ago
MoonshotAI / Kimi-VL
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
☆1,102Updated 4 months ago
2U1 / Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
☆172Updated 3 weeks ago
FreedomIntelligence / ALLaVA
Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
☆276Updated last year
NExT-ChatV / NExT-Chat
The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".
☆255Updated last year
FanqingM / MM-Eureka-V0
MM-Eureka V0 also called R1-Multimodal-Journey, Latest version is in MM-Eureka
☆320Updated 4 months ago
AIDC-AI / Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
☆1,406Updated last month