woct0rdho / transformers-qwen3-moe-fusedLinks
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆204Updated this week
Alternatives and similar repositories for transformers-qwen3-moe-fused
Users that are interested in transformers-qwen3-moe-fused are comparing it to the libraries listed below
Sorting:
- A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size☆77Updated 2 months ago
- ☆299Updated 5 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆193Updated last month
- [ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation☆118Updated 5 months ago
- A collection of tricks and tools to speed up transformer models☆187Updated last week
- ☆90Updated 5 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆138Updated last year
- ☆85Updated 7 months ago
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆223Updated last week
- patches for huggingface transformers to save memory☆31Updated 5 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆97Updated 5 months ago
- ☆98Updated 3 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆201Updated last month
- ☆60Updated 5 months ago
- ☆65Updated 7 months ago
- Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?☆115Updated last year
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning. COLM 2024 Accepted Paper☆33Updated last year
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆450Updated 5 months ago
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆264Updated 2 weeks ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆222Updated 3 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆128Updated 2 weeks ago
- Nano repo for RL training of LLMs☆66Updated last week
- Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.☆197Updated this week
- Deep Reasoning Translation (DRT) Project☆236Updated 2 months ago
- minimal GRPO implementation from scratch☆99Updated 7 months ago
- ☆125Updated 6 months ago
- Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.☆217Updated last month
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆101Updated 2 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆152Updated last year
- Efficient Agent Training for Computer Use☆132Updated 2 months ago