IBM / pytorch-communication-benchmarksLinks

pytorch code examples for measuring the performance of collective communication calls in AI workloads

☆18

Alternatives and similar repositories for pytorch-communication-benchmarks

Users that are interested in pytorch-communication-benchmarks are comparing it to the libraries listed below

Sorting:

hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆116Updated 6 months ago
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆64Updated last year
deepspeedai / DeepSpeed-Kernels
☆72Updated 3 months ago
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 11 months ago
ROCm / TransformerEngine
☆38Updated this week
AlibabaPAI / FLASHNN
☆96Updated 9 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆167Updated this week
HabanaAI / Megatron-DeepSpeed
Intel Gaudi's Megatron DeepSpeed Large Language Models for training
☆13Updated 6 months ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆44Updated 10 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 2 weeks ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆117Updated this week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆143Updated this week
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆38Updated 3 months ago
osayamenja / Kleos
Complete GPU residency for ML.
☆17Updated last week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated last month
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆80Updated last month
Youhe-Jiang / IJCAI2023-OptimalShardedDataParallel
[IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…
☆51Updated 2 years ago
argonne-lcf / LLM-Inference-Bench
LLM-Inference-Bench
☆45Updated 2 weeks ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆66Updated last week
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆76Updated this week
WukLab / preble
Stateful LLM Serving
☆73Updated 3 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆75Updated 5 months ago
stanford-futuredata / stk
☆105Updated 10 months ago
cchan / tccl
extensible collectives library in triton
☆86Updated 2 months ago
mlcommons / logging
MLPerf™ logging library
☆36Updated 2 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆110Updated 9 months ago