allenai / molmo
Code for the Molmo Vision-Language Model
☆413Updated 5 months ago
Alternatives and similar repositories for molmo
Users that are interested in molmo are comparing it to the libraries listed below
Sorting:
- Rethinking Step-by-step Visual Reasoning in LLMs☆293Updated 3 months ago
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆237Updated 9 months ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆376Updated 3 weeks ago
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆205Updated this week
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆529Updated this week
- Explore the Multimodal “Aha Moment” on 2B Model☆586Updated last month
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆1,071Updated last week
- This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E…☆424Updated last month
- OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆214Updated this week
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆515Updated this week
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆513Updated last month
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆675Updated 10 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆321Updated 10 months ago
- E5-V: Universal Embeddings with Multimodal Large Language Models☆248Updated 4 months ago
- A fork to add multimodal model training to open-r1☆1,255Updated 3 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-sta…☆559Updated last week
- Compose multimodal datasets 🎹☆371Updated 3 weeks ago
- ☆609Updated last year
- Long Context Transfer from Language to Vision☆374Updated last month
- Official implementation of paper: SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training☆275Updated 2 weeks ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Langua…☆425Updated 4 months ago
- When do we not need larger vision models?☆392Updated 3 months ago
- Official repo and evaluation implementation of VSI-Bench☆481Updated 2 months ago
- MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning☆597Updated last week
- R1-onevision, a visual language model capable of deep CoT reasoning.☆515Updated last month
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆311Updated 4 months ago
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆185Updated 3 months ago
- [AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference☆276Updated 4 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆189Updated last month
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆326Updated 2 months ago