woct0rdho / transformers-qwen3-moe-fusedLinks
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆217Updated last month
Alternatives and similar repositories for transformers-qwen3-moe-fused
Users that are interested in transformers-qwen3-moe-fused are comparing it to the libraries listed below
Sorting:
- A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size☆81Updated 3 months ago
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆118Updated 7 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆198Updated 3 weeks ago
- ☆93Updated 7 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆137Updated last year
- ☆63Updated 7 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆103Updated 7 months ago
- A collection of tricks and tools to speed up transformer models☆193Updated last week
- ☆374Updated last week
- ☆85Updated 8 months ago
- Cookbook of SGLang - Recipe☆32Updated last week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆212Updated 2 months ago
- ☆66Updated 9 months ago
- Nano repo for RL training of LLMs☆70Updated last month
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆222Updated 5 months ago
- RWKV-7: Surpassing GPT☆101Updated last year
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆226Updated last month
- minimal GRPO implementation from scratch☆100Updated 9 months ago
- A pipeline for LLM knowledge distillation☆111Updated 8 months ago
- ☆148Updated last year
- Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"☆94Updated 4 months ago
- ☆98Updated 4 months ago
- Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?☆119Updated last year
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆100Updated 3 months ago
- patches for huggingface transformers to save memory☆32Updated 6 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆463Updated 7 months ago
- Tina: Tiny Reasoning Models via LoRA☆310Updated 3 months ago
- Efficient Agent Training for Computer Use☆134Updated 3 months ago
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.☆215Updated 6 months ago
- Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.☆248Updated 2 months ago