mit-han-lab / vila-u
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
β199Updated this week
Alternatives and similar repositories for vila-u:
Users that are interested in vila-u are comparing it to the libraries listed below
- π₯ Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".β222Updated 2 weeks ago
- Empowering Unified MLLM with Multi-granular Visual Generationβ114Updated this week
- Official implementation of the Law of Vision Representation in MLLMsβ145Updated 2 months ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Modelsβ261Updated 3 months ago
- Adaptive Caching for Faster Video Generation with Diffusion Transformersβ134Updated 2 months ago
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β328Updated 3 weeks ago
- π Collection of awesome generation acceleration resources.β93Updated this week
- My implementation of "Patch nβ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"β207Updated 2 months ago
- β221Updated 6 months ago
- β132Updated this week
- official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Inputβ61Updated 4 months ago
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Modelsβ109Updated 8 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!β123Updated last year
- Long Context Transfer from Language to Visionβ356Updated last month
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perceptionβ130Updated last month
- π₯ Aurora Series: A more efficient multimodal large language model series for video.β62Updated 2 months ago
- Explore the Limits of Omni-modal Pretraining at Scaleβ96Updated 4 months ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generationβ84Updated 4 months ago
- The official implementation for "MonoFormer: One Transformer for Both Diffusion and Autoregression"β80Updated 3 months ago
- XQ-GANπ: An Open-source Image Tokenization Framework for Autoregressive Generationβ178Updated last month
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creationβ409Updated last month
- β128Updated last month
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ188Updated last week
- The official implementation of PAR: Parallelized Autoregressive Visual Generation. https://epiphqny.github.io/PAR-project/β106Updated 2 weeks ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.orβ¦β112Updated 6 months ago
- Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Surveyβ274Updated this week
- NOVA: Autoregressive Video Generation without Vector Quantizationβ314Updated this week
- [ECCV 2024π₯] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"β134Updated 4 months ago
- [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understandingβ56Updated last week
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizerβ210Updated 9 months ago