mit-han-lab / vila-u
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
β246Updated 2 months ago
Alternatives and similar repositories for vila-u:
Users that are interested in vila-u are comparing it to the libraries listed below
- EVE Series: Encoder-Free Vision-Language Models from BAAIβ313Updated 3 weeks ago
- [CVPR 2025] π₯ Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".β292Updated 2 weeks ago
- Empowering Unified MLLM with Multi-granular Visual Generationβ119Updated 2 months ago
- [CVPR 2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β134Updated 3 weeks ago
- Official implementation of the Law of Vision Representation in MLLMsβ151Updated 4 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".β201Updated 3 months ago
- [ICLR 2025] Autoregressive Video Generation without Vector Quantizationβ419Updated this week
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"β167Updated this week
- A Unified Tokenizer for Visual Generation and Understandingβ210Updated 3 weeks ago
- Long Context Transfer from Language to Visionβ368Updated last week
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Modelsβ119Updated 10 months ago
- Adaptive Caching for Faster Video Generation with Diffusion Transformersβ142Updated 4 months ago
- [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generationβ256Updated 3 weeks ago
- β138Updated 2 months ago
- My implementation of "Patch nβ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"β224Updated last month
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creationβ427Updated 3 months ago
- [CVPR2025] PAR: Parallelized Autoregressive Visual Generation. https://yuqingwang1029.github.io/PAR-project/β127Updated this week
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizerβ220Updated 11 months ago
- β¨β¨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Modelsβ154Updated 2 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, β¦β107Updated last month
- This is a repo to track the latest autoregressive visual generation papers.β169Updated this week
- [CVPR2025] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Modelsβ165Updated this week
- Official repo and evaluation implementation of VSI-Benchβ421Updated 3 weeks ago
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β415Updated last week
- γNeurIPS 2024γDense Connector for MLLMsβ157Updated 5 months ago
- [ICLR 2025] Diffusion Feedback Helps CLIP See Betterβ268Updated 2 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"β130Updated 4 months ago
- The official implementation for "MonoFormer: One Transformer for Both Diffusion and Autoregression"β86Updated 5 months ago
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"β168Updated 5 months ago
- β111Updated 7 months ago