TencentARC / mllm-npu
mllm-npu: training multimodal large language models on Ascend NPUs
☆90Updated 4 months ago
Alternatives and similar repositories for mllm-npu:
Users that are interested in mllm-npu are comparing it to the libraries listed below
- Adaptive Caching for Faster Video Generation with Diffusion Transformers☆134Updated 2 months ago
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆199Updated this week
- A light-weight and high-efficient training framework for accelerating diffusion tasks.☆44Updated 4 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆49Updated last week
- ☆127Updated this week
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer☆210Updated 9 months ago
- Pruning the VLLMs☆77Updated last month
- 📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉☆168Updated this week
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆38Updated 9 months ago
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆129Updated 7 months ago
- official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input☆61Updated 4 months ago
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆91Updated 6 months ago
- 📚 Collection of awesome generation acceleration resources.☆93Updated this week
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆178Updated this week
- Accelerating Diffusion Transformers with Token-wise Feature Caching☆46Updated last week
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆200Updated 10 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 6 months ago
- ☆132Updated this week
- ☆162Updated last month
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆188Updated last week
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆207Updated 2 months ago
- ☆107Updated 5 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆127Updated 7 months ago
- Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.☆44Updated last month
- ☆159Updated 6 months ago
- This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"☆109Updated 2 weeks ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆68Updated this week
- [ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance☆76Updated 6 months ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆261Updated 3 months ago