allenai / molmo
Code for the Molmo Vision-Language Model
☆292Updated 2 months ago
Alternatives and similar repositories for molmo:
Users that are interested in molmo are comparing it to the libraries listed below
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆470Updated last month
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆190Updated last month
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer☆366Updated last month
- When do we not need larger vision models?☆368Updated last week
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆229Updated 6 months ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆134Updated 5 months ago
- Long Context Transfer from Language to Vision☆360Updated 3 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆313Updated 7 months ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]☆199Updated this week
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of …☆476Updated 6 months ago
- The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A su…☆223Updated last month
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆151Updated 3 weeks ago
- ☆308Updated last year
- Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models☆192Updated last month
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆255Updated 7 months ago
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆577Updated 7 months ago
- Rethinking Step-by-step Visual Reasoning in LLMs☆247Updated 3 weeks ago
- ☆599Updated last year
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆141Updated 8 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆295Updated last week
- The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"☆239Updated 3 weeks ago
- [ICML'24 Oral] "MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions"☆162Updated 3 months ago
- This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR25]☆142Updated last week
- Famous Vision Language Models and Their Architectures☆646Updated last week
- A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision,…☆246Updated 2 weeks ago
- VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning☆101Updated 5 months ago
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆309Updated 3 months ago
- This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E…☆391Updated last month
- Official implementation of the Law of Vision Representation in MLLMs☆149Updated 3 months ago
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆237Updated last month