IBM / pytorch-communication-benchmarksLinks
pytorch code examples for measuring the performance of collective communication calls in AI workloads
☆18Updated 7 months ago
Alternatives and similar repositories for pytorch-communication-benchmarks
Users that are interested in pytorch-communication-benchmarks are comparing it to the libraries listed below
Sorting:
- A Python library transfers PyTorch tensors between CPU and NVMe☆116Updated 6 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- ☆72Updated 3 months ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆109Updated 11 months ago
- ☆38Updated this week
- ☆96Updated 9 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆167Updated this week
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 6 months ago
- A minimal implementation of vllm.☆44Updated 10 months ago
- GPTQ inference TVM kernel☆40Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆90Updated 2 weeks ago
- A Quirky Assortment of CuTe Kernels☆117Updated this week
- A lightweight design for computation-communication overlap.☆143Updated this week
- A simple calculation for LLM MFU.☆38Updated 3 months ago
- Complete GPU residency for ML.☆17Updated last week
- DeeperGEMM: crazy optimized version☆69Updated last month
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆80Updated last month
- [IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…☆51Updated 2 years ago
- LLM-Inference-Bench☆45Updated 2 weeks ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- Ahead of Time (AOT) Triton Math Library☆66Updated last week
- Fast and memory-efficient exact attention☆76Updated this week
- Stateful LLM Serving☆73Updated 3 months ago
- ☆75Updated 5 months ago
- ☆105Updated 10 months ago
- extensible collectives library in triton☆86Updated 2 months ago
- MLPerf™ logging library☆36Updated 2 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆110Updated 9 months ago