apple / ml-aimLinks

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.

☆1,387

Alternatives and similar repositories for ml-aim

Users that are interested in ml-aim are comparing it to the libraries listed below

Sorting:

facebookresearch / MetaCLIP
NeurIPS 2025 Spotlight; ICLR2024 Spotlight; CVPR 2024; EMNLP 2024
☆1,750Updated this week
apple / ml-4m
4M: Massively Multimodal Masked Modeling
☆1,773Updated 6 months ago
cambrian-mllm / cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
☆1,974Updated 3 weeks ago
facebookresearch / chameleon
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
☆2,068Updated last year
facebookresearch / perception_models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
☆1,776Updated 2 months ago
NVIDIA / Cosmos-Tokenizer
A suite of image and video neural tokenizers
☆1,686Updated 9 months ago
apple / ml-mobileclip
This repository contains the official implementation of the research papers, "MobileCLIP" CVPR 2024 and "MobileCLIP2" TMLR August 2025
☆1,324Updated last month
NVlabs / RADIO
Official repository for "AM-RADIO: Reduce All Domains Into One"
☆1,403Updated last week
facebookresearch / hiera
Hiera: A fast, powerful, and simple hierarchical vision transformer.
☆1,041Updated last year
FoundationVision / LlamaGen
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
☆1,904Updated last year
microsoft / LLM2CLIP
LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.
☆567Updated this week
allenai / molmo
Code for the Molmo Vision-Language Model
☆814Updated 11 months ago
google-research / big_vision
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
☆3,247Updated 6 months ago
bfshi / scaling_on_scales
When do we not need larger vision models?
☆412Updated 9 months ago
OpenGVLab / VisionLLM
VisionLLM Series
☆1,128Updated 9 months ago
penghao-wu / vstar
PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
☆681Updated last year
mbzuai-oryx / groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…
☆930Updated 3 months ago
allenai / unified-io-2
☆632Updated last year
LLaVA-VL / LLaVA-Plus-Codebase
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
☆762Updated last year
baaivision / Emu
Emu Series: Generative Multimodal Models from BAAI
☆1,761Updated last year
mlfoundations / datacomp
DataComp: In search of the next generation of multimodal datasets
☆750Updated 7 months ago
baaivision / Emu3
Next-Token Prediction is All You Need
☆2,257Updated 2 weeks ago
microsoft / Samba
[ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
☆932Updated 2 weeks ago
apple / ml-veclip
The official repo for the paper "VeCLIP: Improving CLIP Training via Visual-enriched Captions"
☆248Updated 10 months ago
PKU-YuanGroup / MoE-LLaVA
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
☆2,278Updated 4 months ago
IDEA-Research / Grounding-DINO-1.5-API
Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series
☆1,061Updated 10 months ago
BAAI-DCAI / Bunny
A family of lightweight multimodal models.
☆1,047Updated last year
facebookresearch / jepa
PyTorch code and models for V-JEPA self-supervised learning from video.
☆3,289Updated 9 months ago
LLaVA-VL / LLaVA-Interactive-Demo
LLaVA-Interactive-Demo
☆379Updated last year
ytongbai / LVM
☆1,839Updated last year