Azure / MS-AMPLinks

Microsoft Automatic Mixed Precision Library

☆616

Alternatives and similar repositories for MS-AMP

Users that are interested in MS-AMP are comparing it to the libraries listed below

Sorting:

zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 2 months ago
haoliuhl / ringattention
Large Context Attention
☆719Updated 6 months ago
microsoft / Tutel
Tutel MoE: Optimized Mixture-of-Experts Library, Support DeepSeek/Kimi-K2/Qwen3 FP8/FP4
☆870Updated last week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 4 months ago
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆537Updated 2 weeks ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆868Updated 10 months ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆305Updated 4 months ago
FMInference / DejaVu
☆331Updated last year
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆775Updated 11 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,262Updated 4 months ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆362Updated 11 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆273Updated last year
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆410Updated 8 months ago
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆462Updated last year
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆653Updated 3 weeks ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆318Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆303Updated 2 years ago
cli99 / llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
☆441Updated 3 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,461Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 2 weeks ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆208Updated last year
neuralmagic / AutoFP8
☆195Updated 2 months ago
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆697Updated 11 months ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆211Updated last year
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆532Updated 2 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆134Updated 2 weeks ago