Azure / MS-AMP-ExamplesLinks

Examples for MS-AMP package.

☆29

Alternatives and similar repositories for MS-AMP-Examples

Users that are interested in MS-AMP-Examples are comparing it to the libraries listed below

Sorting:

RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆107Updated 2 months ago
yanring / Megatron-MoE-ModelZoo
Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.
☆45Updated last week
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆72Updated last year
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆134Updated 3 weeks ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆199Updated this week
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆221Updated last month
stanford-futuredata / stk
☆107Updated 11 months ago
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆415Updated 3 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆128Updated 11 months ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆275Updated last year
zhuohan123 / terapipe
☆75Updated 4 years ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆46Updated last year
thu-pacman / FasterMoE
☆85Updated 3 years ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆103Updated 4 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
LiuXiaoxuanPKU / GACT-ICML
☆42Updated 2 years ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆211Updated last year
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆118Updated 8 months ago
anyscale / llm-continuous-batching-benchmarks
☆120Updated last year
NVIDIA / Megatron-Energon
Megatron's multi-modal data loader
☆232Updated last week
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆54Updated last month
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆388Updated this week
FasterDecoding / TEAL
☆137Updated 5 months ago