mit-han-lab / vila-u
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
β227Updated 3 weeks ago
Alternatives and similar repositories for vila-u:
Users that are interested in vila-u are comparing it to the libraries listed below
- π₯ Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".β253Updated last month
- EVE Series: Encoder-Free Vision-Language Models from BAAIβ295Updated last week
- Official implementation of the Law of Vision Representation in MLLMsβ149Updated 3 months ago
- Long Context Transfer from Language to Visionβ360Updated 3 months ago
- Adaptive Caching for Faster Video Generation with Diffusion Transformersβ142Updated 3 months ago
- Empowering Unified MLLM with Multi-granular Visual Generationβ117Updated last month
- π Collection of awesome generation acceleration resources.β139Updated this week
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Modelsβ116Updated 9 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ189Updated last month
- This is a repo to track the latest autoregressive visual generation papers.β139Updated last week
- [ICLR 2025] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generationβ243Updated last week
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β374Updated last month
- [ICLR 2025] Autoregressive Video Generation without Vector Quantizationβ385Updated this week
- HART: Efficient Visual Generation with Hybrid Autoregressive Transformerβ418Updated 4 months ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Languaβ¦β364Updated last month
- The official implementation for "MonoFormer: One Transformer for Both Diffusion and Autoregression"β84Updated 4 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!β125Updated last year
- Explore the Limits of Omni-modal Pretraining at Scaleβ96Updated 5 months ago
- The paper collections for the autoregressive models in vision.β406Updated this week
- β134Updated last month
- Implements VAR+CLIP for text-to-image (T2I) generationβ119Updated 3 weeks ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizerβ210Updated 10 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, β¦β101Updated 2 weeks ago
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"β165Updated 4 months ago
- SpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Trainingβ174Updated 3 weeks ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".β191Updated 2 months ago
- [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understandingβ62Updated last month
- β¨β¨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Modelsβ151Updated last month
- β144Updated last month