allenai / molmoLinks
Code for the Molmo Vision-Language Model
☆431Updated 5 months ago
Alternatives and similar repositories for molmo
Users that are interested in molmo are comparing it to the libraries listed below
Sorting:
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆207Updated this week
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆238Updated 9 months ago
- [ACL 2025 🔥] Rethinking Step-by-step Visual Reasoning in LLMs☆299Updated 2 weeks ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆321Updated 10 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆546Updated last week
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆379Updated last month
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆1,189Updated last week
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆693Updated 11 months ago
- Explore the Multimodal “Aha Moment” on 2B Model☆589Updated 2 months ago
- Quick exploration into fine tuning florence 2☆316Updated 8 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆520Updated 2 months ago
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆619Updated last year
- When do we not need larger vision models?☆396Updated 3 months ago
- An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.☆55Updated last month
- An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.☆784Updated this week
- This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E…☆438Updated 2 weeks ago
- Long Context Transfer from Language to Vision☆378Updated 2 months ago
- Official repo and evaluation implementation of VSI-Bench☆492Updated 3 months ago
- Compose multimodal datasets 🎹☆393Updated this week
- [Fully open] [Encoder-free MLLM] Vision as LoRA☆280Updated last week
- VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning☆112Updated 8 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆220Updated 8 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-sta…☆579Updated 3 weeks ago
- OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆237Updated 3 weeks ago
- ☆613Updated last year
- (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.☆342Updated 4 months ago
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of …☆486Updated 9 months ago
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆379Updated last month
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆142Updated 9 months ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]☆215Updated 2 months ago