microsoft / TutelLinks

Tutel MoE: Optimized Mixture-of-Experts Library, Support DeepSeek/Kimi-K2/Qwen3 FP8/FP4

☆870

Alternatives and similar repositories for Tutel

Users that are interested in Tutel are comparing it to the libraries listed below

Sorting:

Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆616Updated 10 months ago
google-research / vmoe
☆656Updated last month
haoliuhl / ringattention
Large Context Attention
☆719Updated 6 months ago
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
laekov / fastmoe
A fast MoE impl for PyTorch
☆1,769Updated 5 months ago
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆775Updated 11 months ago
codecaution / Awesome-Mixture-of-Experts-Papers
A curated reading list of research in Mixture-of-Experts(MoE).
☆639Updated 9 months ago
lucidrains / mixture-of-experts
A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
☆788Updated last year
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 2 months ago
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,461Updated last year
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆532Updated 2 months ago
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆537Updated 2 weeks ago
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,263Updated 4 months ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆305Updated 5 months ago
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆462Updated last year
volcengine / veScale
A PyTorch Native LLM Training Framework
☆839Updated 3 weeks ago
pytorch-labs / attention-gym
Helpful tools and examples for working with flex-attention
☆904Updated 2 weeks ago
IST-DASLab / sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
☆820Updated 11 months ago
FMInference / DejaVu
☆331Updated last year
thuml / depyf
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
☆708Updated 3 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
cli99 / llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
☆441Updated 3 months ago
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Bla…
☆2,587Updated last week
pytorch / kineto
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
☆842Updated last week
locuslab / wanda
A simple and effective LLM pruning approach.
☆781Updated 11 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 4 months ago
lucidrains / st-moe-pytorch
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
☆352Updated last year
feifeibear / LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
☆791Updated 11 months ago
SafeAILab / EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
☆1,439Updated last week
deepspeedai / Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆2,126Updated 2 weeks ago