MattUnderscoreZhang / videopoet_replication
A replication of Google's VideoPoet model
☆12Updated 11 months ago
Alternatives and similar repositories for videopoet_replication:
Users that are interested in videopoet_replication are comparing it to the libraries listed below
- ☆112Updated 6 months ago
- ☆131Updated this week
- This is a repo to track the latest autoregressive visual generation papers.☆105Updated 3 weeks ago
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆207Updated this week
- MoVQGAN - model for the image encoding and reconstruction☆212Updated last year
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆126Updated 2 months ago
- 📚 Collection of awesome generation acceleration resources.☆93Updated last week
- ☆58Updated 3 months ago
- Scaling Diffusion Transformers with Mixture of Experts☆243Updated 4 months ago
- Explore the Limits of Omni-modal Pretraining at Scale☆96Updated 4 months ago
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆203Updated last week
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆91Updated 6 months ago
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆200Updated 10 months ago
- 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".☆226Updated 3 weeks ago
- ☆128Updated last month
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆57Updated 2 months ago
- XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation☆179Updated last month
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆264Updated 3 months ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆333Updated this week
- STAR: Scale-wise Text-to-image generation via Auto-Regressive representations☆133Updated 7 months ago
- LVBench: An Extreme Long Video Understanding Benchmark☆74Updated 4 months ago
- [MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501☆46Updated 5 months ago
- LMM which strictly superset LLM embedded☆37Updated 2 months ago
- The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A su…☆218Updated this week
- Matryoshka Multimodal Models☆90Updated 2 months ago
- Official repo for StableLLAVA☆94Updated last year
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆147Updated 3 weeks ago
- The official implementation of Latte: Latent Diffusion Transformer for Video Generation.☆32Updated 10 months ago
- ☆132Updated last year
- ☆98Updated 6 months ago