2U1 / Molmo-FinetuneLinks

An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.

☆58

Alternatives and similar repositories for Molmo-Finetune

Users that are interested in Molmo-Finetune are comparing it to the libraries listed below

Sorting:

thunlp / LLaVA-UHD
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
☆382Updated 3 months ago
MILVLG / imp
a family of highly capabale yet efficient large multimodal models
☆186Updated 11 months ago
zai-org / CogCoM
☆204Updated last year
luogen1996 / LLaVA-HR
[ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant
☆238Updated 11 months ago
kyegomez / PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
☆145Updated 2 weeks ago
IDEA-Research / ChatRex
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
☆198Updated 6 months ago
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆160Updated 7 months ago
SHI-Labs / CuMo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
☆152Updated last year
mbzuai-oryx / LlamaV-o1
[ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs
☆305Updated 2 months ago
jefferyZhan / Griffon
Official repo of Griffon series including v1(ECCV 2024), v2, and G
☆227Updated 2 months ago
WisconsinAIVision / ViP-LLaVA
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
☆327Updated last year
FreedomIntelligence / ALLaVA
Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
☆267Updated last year
TempleX98 / MoVA
[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
☆165Updated 10 months ago
TIGER-AI-Lab / Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]
☆223Updated 4 months ago
Ucas-HaoranWei / Slow-Perception
Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step
☆131Updated last week
Beckschen / ViTamin
[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"
☆207Updated last year
zjysteven / lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…
☆317Updated 5 months ago
kyegomez / PALI
Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"
☆92Updated last year
Victorwz / Open-Qwen2VL
[COLM 2025] Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
☆244Updated 2 months ago
ZhangXJ199 / TinyLLaVA-Video
A Simple Framework of Small-scale LMMs for Video Understanding
☆73Updated last month
kongds / E5-V
E5-V: Universal Embeddings with Multimodal Large Language Models
☆262Updated 7 months ago
baaivision / EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
☆342Updated 2 weeks ago
FreedomIntelligence / LongLLaVA
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
☆209Updated 7 months ago
2U1 / SmolVLM-Finetune
An open-source implementaion for fine-tuning SmolVLM.
☆42Updated 3 months ago
sandy1990418 / Finetune-Qwen2.5-VL
Fine-tuning Qwen2.5-VL for vision-language tasks | Optimized for Vision understanding | LoRA & PEFT support.
☆107Updated 6 months ago
TIGER-AI-Lab / VLM2Vec
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
☆355Updated this week
TIGER-AI-Lab / Pixel-Reasoner
Pixel-Level Reasoning Model trained with RL
☆187Updated last month
mbzuai-oryx / Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
☆257Updated this week
FudanNLPLAB / MouSi
☆73Updated last year
zzxslp / SoM-LLaVA
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
☆144Updated 11 months ago