allenai / molmo
Code for the Molmo Vision-Language Model
☆377Updated 4 months ago
Alternatives and similar repositories for molmo:
Users that are interested in molmo are comparing it to the libraries listed below
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆548Updated this week
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆506Updated last month
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆201Updated this week
- Rethinking Step-by-step Visual Reasoning in LLMs☆289Updated 3 months ago
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆236Updated 8 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆322Updated last month
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆374Updated this week
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆319Updated 9 months ago
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"☆594Updated last year
- When do we not need larger vision models?☆388Updated 2 months ago
- Explore the Multimodal “Aha Moment” on 2B Model☆577Updated last month
- Compose multimodal datasets 🎹☆351Updated this week
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of …☆481Updated 8 months ago
- A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…☆290Updated 2 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆184Updated 3 weeks ago
- ☆610Updated last year
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆298Updated 4 months ago
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆652Updated 9 months ago
- Long Context Transfer from Language to Vision☆373Updated last month
- ☆328Updated last year
- An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.☆648Updated this week
- Quick exploration into fine tuning florence 2☆308Updated 7 months ago
- ☆381Updated 4 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆469Updated this week
- Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with g…☆356Updated this week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆216Updated 7 months ago
- VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning☆108Updated 7 months ago
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆147Updated 10 months ago
- [Fully open] [Encoder-free MLLM] Vision as LoRA☆138Updated last week
- [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…☆866Updated 5 months ago