MattUnderscoreZhang / videopoet_replication
A replication of Google's VideoPoet model
☆11Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for videopoet_replication
- Explore the Limits of Omni-modal Pretraining at Scale☆89Updated 2 months ago
- ☆127Updated 3 weeks ago
- ☆104Updated 4 months ago
- Official repo for StableLLAVA☆91Updated 11 months ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆217Updated this week
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆185Updated 2 weeks ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆140Updated 2 weeks ago
- Video dataset dedicated to portrait-mode video recognition.☆38Updated 7 months ago
- ☆105Updated 3 months ago
- 🔥ImageFolder: Autoregressive Image Generation with Folded Tokens☆59Updated last week
- ☆131Updated 11 months ago
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆194Updated 8 months ago
- ☆35Updated 5 months ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆246Updated 5 months ago
- Official implementation of the Law of Vision Representation in MLLMs☆134Updated last week
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆116Updated last week
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆140Updated 2 weeks ago
- Scaling Diffusion Transformers with Mixture of Experts☆207Updated 2 months ago
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions☆157Updated 4 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆120Updated last month
- This is a repo to track the latest autoregressive visual generation papers.☆50Updated this week
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆41Updated 3 weeks ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆232Updated last month
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆53Updated 3 weeks ago
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆115Updated last month
- ☆120Updated last month
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆42Updated 3 months ago
- [NeurIPS 2024] CV-VAE: A Compatible Video VAE for Latent Generative Video Models☆246Updated 3 weeks ago
- Implements VAR+CLIP for image generation☆78Updated 3 months ago