MattUnderscoreZhang / videopoet_replicationLinks
A replication of Google's VideoPoet model
☆11Updated last year
Alternatives and similar repositories for videopoet_replication
Users that are interested in videopoet_replication are comparing it to the libraries listed below
Sorting:
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆121Updated last year
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆270Updated last month
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆212Updated last year
- ☆140Updated last year
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆247Updated last year
- MoVQGAN - model for the image encoding and reconstruction☆257Updated 2 years ago
- LMM solved catastrophic forgetting, AAAI2025☆44Updated 8 months ago
- ☆156Updated 11 months ago
- Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model☆277Updated last year
- [CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for C…☆274Updated 11 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆147Updated last year
- ☆65Updated 7 months ago
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".☆68Updated 8 months ago
- Keras implement of Finite Scalar Quantization☆83Updated 2 years ago
- [COLM'25] Official implementation of the Law of Vision Representation in MLLMs☆171Updated 2 months ago
- VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks☆390Updated last year
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆211Updated 11 months ago
- ☆87Updated last year
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆39Updated last year
- [NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation☆73Updated 3 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆361Updated 4 months ago
- A list for Text-to-Video, Image-to-Video works☆250Updated 6 months ago
- LaVIT: Empower the Large Language Model to Understand and Generate Visual Content☆599Updated last year
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆169Updated last year
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆211Updated last year
- Scaling Diffusion Transformers with Mixture of Experts☆410Updated last year
- Official repo for StableLLAVA☆95Updated last year
- https://www.shoufachen.com/Awesome-Diffusion-Transformers/☆152Updated last year
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆220Updated 9 months ago
- ☆133Updated last year