HaozheZhao / MENTORLinks
☆30Updated 4 months ago
Alternatives and similar repositories for MENTOR
Users that are interested in MENTOR are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation☆71Updated 2 months ago
- ☆132Updated last month
- A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…☆52Updated 3 months ago
- [Preprint] GMem: A Modular Approach for Ultra-Efficient Generative Models☆40Updated 8 months ago
- The official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)☆80Updated 6 months ago
- Explore how to get a VQ-VAE models efficiently!☆62Updated 3 months ago
- AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model☆49Updated last month
- On Path to Multimodal Generalist: General-Level and General-Bench☆19Updated 4 months ago
- ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer☆38Updated 10 months ago
- [ICML 2025] This is the official PyTorch implementation of "ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality…☆53Updated 7 months ago
- Codebase for the paper-Elucidating the design space of language models for image generation☆46Updated last year
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆72Updated last year
- ☆63Updated 6 months ago
- LMM solved catastrophic forgetting, AAAI2025☆44Updated 7 months ago
- [CVPR 2025 AI4CC Workshop] Official Implementation of HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editin…☆35Updated 6 months ago
- Text-Only Data Synthesis for Vision Language Model Training☆22Updated 5 months ago
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆37Updated last year
- \infty-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation☆19Updated 9 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Updated 9 months ago
- Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)☆43Updated last month
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model☆22Updated last year
- ☆43Updated 5 months ago
- Official implementation of "UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing"☆96Updated last week
- ☆73Updated last month
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆78Updated 11 months ago
- ☆35Updated 6 months ago
- ☆62Updated 4 months ago
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆60Updated 4 months ago
- ☆37Updated 2 months ago
- CoDi:Subject-Consistent and Pose-Diverse Text-to-Image Generation☆36Updated 3 months ago