mit-han-lab / vila-uLinks

[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

☆394

Alternatives and similar repositories for vila-u

Users that are interested in vila-u are comparing it to the libraries listed below

Sorting:

NVlabs / Long-RL
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
☆644Updated 3 weeks ago
ByteVisionLab / TokenFlow
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
☆393Updated 2 months ago
rongyaofang / GoT
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
☆291Updated 3 weeks ago
facebookresearch / metamorph
Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning
☆214Updated 6 months ago
FoundationVision / UniTok
[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
☆425Updated last month
wdrink / SimpleAR
Pytorch implementation for the paper titled "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation"
☆410Updated 4 months ago
baaivision / EVE
EVE Series: Encoder-Free Vision-Language Models from BAAI
☆353Updated 3 months ago
CodeGoat24 / UnifiedReward
Official implementation of UnifiedReward & [NeurIPS 2025] UnifiedReward-Think
☆574Updated this week
selftok-team / SelftokTokenizer
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
☆224Updated 4 months ago
ML-GSAI / LLaDA-V
☆254Updated last week
lxa9867 / Awesome-Autoregressive-Visual-Generation
This is a repo to track the latest autoregressive visual generation papers.
☆405Updated 3 months ago
rongyaofang / PUMA
Empowering Unified MLLM with Multi-granular Visual Generation
☆130Updated 9 months ago
baaivision / NOVA
[ICLR 2025] Autoregressive Video Generation without Vector Quantization
☆578Updated last month
baaivision / DIVA
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
☆289Updated 9 months ago
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆129Updated 4 months ago
lxa9867 / ImageFolder
High-performance Image Tokenizers for VAR and AR
☆291Updated 5 months ago
ziqipang / RandAR
[CVPR 2025 (Oral)] Open implementation of "RandAR"
☆197Updated 3 months ago
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆121Updated 6 months ago
Yxxxb / VoCo-LLaMA
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
☆191Updated 4 months ago
facebookresearch / metaquery
Official Implementation of Paper Transfer between Modalities with MetaQueries
☆253Updated last week
RunpeiDong / DreamLLM
[ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation
☆459Updated 10 months ago
Chenyu-Wang567 / MLLM-Tool
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
☆134Updated 2 weeks ago
LMM101 / Awesome-Multimodal-Next-Token-Prediction
[Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
☆451Updated 9 months ago
wusize / Harmon
[ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
☆174Updated 5 months ago
TencentARC / TokLIP
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
☆226Updated 2 months ago
bronyayang / Law_of_Vision_Representation_in_MLLMs
[COLM'25] Official implementation of the Law of Vision Representation in MLLMs
☆168Updated 2 weeks ago
TIGER-AI-Lab / Pixel-Reasoner
Pixel-Level Reasoning Model trained with RL [NeuIPS25]
☆238Updated last month
rccchoudhury / rlt
Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".
☆228Updated 6 months ago
42Shawn / LLaVA-PruMerge
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
☆150Updated 3 weeks ago
Purshow / Awesome-Unified-Multimodal
📖 This is a repository for organizing papers, codes, and other resources related to unified multimodal models.
☆319Updated last week