woct0rdho / transformers-qwen3-moe-fusedLinks
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆212Updated 3 weeks ago
Alternatives and similar repositories for transformers-qwen3-moe-fused
Users that are interested in transformers-qwen3-moe-fused are comparing it to the libraries listed below
Sorting:
- A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size☆79Updated 3 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆196Updated 2 months ago
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆118Updated 6 months ago
- ☆61Updated 6 months ago
- ☆300Updated 6 months ago
- ☆85Updated 8 months ago
- A collection of tricks and tools to speed up transformer models☆189Updated last month
- ☆66Updated 8 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆137Updated last year
- ☆91Updated 6 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆100Updated 6 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆206Updated last month
- Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.☆212Updated last week
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆223Updated last month
- patches for huggingface transformers to save memory☆33Updated 6 months ago
- ☆148Updated last year
- Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?☆118Updated last year
- A pipeline for LLM knowledge distillation☆110Updated 8 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆153Updated last week
- ☆98Updated 3 months ago
- ☆73Updated 6 months ago
- RWKV-7: Surpassing GPT☆101Updated last year
- A minimal PyTorch re-implementation of Qwen3 VL with a fancy CLI☆256Updated this week
- KV cache compression for high-throughput LLM inference☆145Updated 10 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆456Updated 6 months ago
- FuseAI Project☆87Updated 10 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆316Updated last week
- minimal GRPO implementation from scratch☆100Updated 8 months ago
- [ICML 2025] From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories and Applications☆51Updated last month
- Tina: Tiny Reasoning Models via LoRA☆309Updated 2 months ago