FoundationVision / OmniTokenizer
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
☆285Updated 8 months ago
Alternatives and similar repositories for OmniTokenizer:
Users that are interested in OmniTokenizer are comparing it to the libraries listed below
- Liquid: Language Models are Scalable and Unified Multi-modal Generators☆288Updated this week
- a family of versatile and state-of-the-art video tokenizers.☆354Updated this week
- Evaluating text-to-image/video/3D models with VQAScore☆266Updated last week
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation☆112Updated 5 months ago
- ☆132Updated 2 months ago
- Visualization of DiT self attention features☆175Updated 7 months ago
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆297Updated 3 weeks ago
- High-performance Image Tokenizers for VAR and AR☆226Updated this week
- [CVPR 2024] Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation☆110Updated last year
- [ICLR 2025] BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities☆139Updated 2 months ago
- Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition☆282Updated 2 months ago
- [CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".☆292Updated 3 weeks ago
- [NeurIPS 2024 D&B Spotlight🔥] ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation☆191Updated last month
- Scaling Diffusion Transformers with Mixture of Experts☆294Updated 6 months ago
- An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?☆112Updated 2 weeks ago
- Investigating CoT Reasoning in Autoregressive Image Generation☆559Updated this week
- [ICLR 2025] Autoregressive Video Generation without Vector Quantization☆419Updated this week
- Official implementation of Unified Reward Model for Multimodal Understanding and Generation.☆214Updated last week
- Implements VAR+CLIP for text-to-image (T2I) generation☆129Updated 2 months ago
- Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models☆152Updated 3 months ago
- [ICLR 2025][arXiv:2406.07548] Image and Video Tokenization with Binary Spherical Quantization☆139Updated 9 months ago
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".☆241Updated 2 months ago
- [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"☆174Updated 3 weeks ago
- A Unified Tokenizer for Visual Generation and Understanding☆210Updated 3 weeks ago
- [ECCV 2024] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation☆291Updated 8 months ago
- (ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator☆107Updated this week
- ✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy☆263Updated this week
- This repository includes the official implementation of our paper "Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generat…☆143Updated 3 weeks ago
- ☆119Updated 8 months ago
- Quick scripts to calculate CLIP text-image similarity☆220Updated 4 months ago