woct0rdho / transformers-qwen3-moe-fusedLinks
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆191Updated 2 weeks ago
Alternatives and similar repositories for transformers-qwen3-moe-fused
Users that are interested in transformers-qwen3-moe-fused are comparing it to the libraries listed below
Sorting:
- ☆296Updated 4 months ago
- ☆89Updated 5 months ago
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆115Updated 5 months ago
- ☆84Updated 6 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆191Updated 2 weeks ago
- A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size☆74Updated last month
- A collection of tricks and tools to speed up transformer models☆182Updated 2 weeks ago
- ☆58Updated 5 months ago
- Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"☆93Updated 2 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆137Updated last year
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆220Updated last month
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆198Updated last week
- Deep Reasoning Translation (DRT) Project☆233Updated last month
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆447Updated 5 months ago
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆101Updated last month
- Efficient Agent Training for Computer Use☆130Updated last month
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.☆197Updated 4 months ago
- MiroThinker is open-source agentic models trained for deep research and complex tool use scenarios.☆467Updated last week
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆219Updated 2 months ago
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆260Updated last year
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆118Updated last week
- ☆71Updated 4 months ago
- Nano repo for RL training of LLMs☆66Updated 2 weeks ago
- ☆64Updated 7 months ago
- A pipeline for LLM knowledge distillation☆109Updated 6 months ago
- Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.☆151Updated 2 weeks ago
- patches for huggingface transformers to save memory☆30Updated 4 months ago
- Data Synthesis for Deep Research Based on Semi-Structured Data☆169Updated last week
- FuseAI Project☆87Updated 8 months ago
- Ling is a MoE LLM provided and open-sourced by InclusionAI.☆226Updated 5 months ago