Azure / MS-AMP-ExamplesLinks
Examples for MS-AMP package.
☆30Updated 2 months ago
Alternatives and similar repositories for MS-AMP-Examples
Users that are interested in MS-AMP-Examples are comparing it to the libraries listed below
Sorting:
- Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.☆108Updated this week
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆215Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆124Updated 4 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆151Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆77Updated last year
- ☆112Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆83Updated last year
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆240Updated 2 months ago
- Utility scripts for PyTorch (e.g. Memory profiler that understands more low-level allocations such as NCCL)☆56Updated last month
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- Training library for Megatron-based models☆116Updated this week
- ☆87Updated 3 years ago
- Zero Bubble Pipeline Parallelism☆429Updated 5 months ago
- 16-fold memory access reduction with nearly no loss☆105Updated 6 months ago
- ☆121Updated last year
- A collection of memory efficient attention operators implemented in the Triton language.☆279Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆168Updated last year
- Pipeline Parallelism Emulation and Visualization☆67Updated 4 months ago
- ☆75Updated 4 years ago
- ☆158Updated 2 years ago
- ☆129Updated 4 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆244Updated last week
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Megatron's multi-modal data loader☆249Updated last week
- ☆143Updated 7 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 2 months ago
- This repository contains integer operators on GPUs for PyTorch.☆218Updated 2 years ago
- ☆78Updated 5 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆138Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆320Updated last year