mit-han-lab / vila-u
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
β271Updated 2 months ago
Alternatives and similar repositories for vila-u:
Users that are interested in vila-u are comparing it to the libraries listed below
- [CVPR 2025] π₯ Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".β307Updated last month
- EVE Series: Encoder-Free Vision-Language Models from BAAIβ320Updated last month
- Long Context Transfer from Language to Visionβ371Updated 3 weeks ago
- A Unified Tokenizer for Visual Generation and Understandingβ249Updated this week
- [ICLR 2025] Autoregressive Video Generation without Vector Quantizationβ466Updated 2 weeks ago
- Official implementation of the Law of Vision Representation in MLLMsβ153Updated 4 months ago
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"β214Updated 3 weeks ago
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β151Updated last month
- Video-R1: Reinforcing Video Reasoning in MLLMs [π₯the first paper to explore R1 for video]β391Updated this week
- Official implementation of Unified Reward Model for Multimodal Understanding and Generation.β238Updated this week
- π This is a repository for organizing papers, codes, and other resources related to unified multimodal models.β171Updated last week
- This is a repo to track the latest autoregressive visual generation papers.β246Updated this week
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ201Updated 3 months ago
- [CVPR2025 Highlight] PAR: Parallelized Autoregressive Visual Generation. https://yuqingwang1029.github.io/PAR-projectβ146Updated 3 weeks ago
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Modelsβ124Updated 11 months ago
- Official repo and evaluation implementation of VSI-Benchβ449Updated last month
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creationβ430Updated 4 months ago
- β141Updated 3 months ago
- Empowering Unified MLLM with Multi-granular Visual Generationβ119Updated 3 months ago
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captionsβ210Updated 9 months ago
- Adaptive Caching for Faster Video Generation with Diffusion Transformersβ144Updated 5 months ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Languaβ¦β407Updated 3 months ago
- [Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Surveyβ416Updated 2 months ago
- Explore the Limits of Omni-modal Pretraining at Scaleβ97Updated 7 months ago
- [CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Cβ¦β240Updated 3 months ago
- Official repository for VisionZip (CVPR 2025)β265Updated last month
- [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generationβ272Updated last month
- [ICLR 2025] Diffusion Feedback Helps CLIP See Betterβ272Updated 2 months ago
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"β174Updated 6 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".β207Updated 2 weeks ago