IBM / pytorch-communication-benchmarks
pytorch code examples for measuring the performance of collective communication calls in AI workloads
☆16Updated 5 months ago
Alternatives and similar repositories for pytorch-communication-benchmarks:
Users that are interested in pytorch-communication-benchmarks are comparing it to the libraries listed below
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆116Updated this week
- A minimal implementation of vllm.☆39Updated 9 months ago
- extensible collectives library in triton☆85Updated 3 weeks ago
- A bunch of kernels that might make stuff slower 😉☆34Updated this week
- a minimal cache manager for PagedAttention, on top of llama3.☆83Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆106Updated 9 months ago
- Ahead of Time (AOT) Triton Math Library☆57Updated last week
- ☆68Updated last month
- ☆78Updated 5 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 5 months ago
- A low-latency & high-throughput serving engine for LLMs☆346Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆41Updated this week
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 4 months ago
- Custom kernels in Triton language for accelerating LLMs☆18Updated last year
- ☆29Updated this week
- ☆82Updated last month
- Applied AI experiments and examples for PyTorch☆262Updated last month
- Collection of kernels written in Triton language☆120Updated 3 weeks ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated last week
- PyTorch bindings for CUTLASS grouped GEMM.☆81Updated 5 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆73Updated 3 weeks ago
- ☆198Updated 9 months ago
- MLPerf™ logging library☆34Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆102Updated this week
- DeeperGEMM: crazy optimized version☆67Updated 3 weeks ago
- ☆103Updated 8 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆248Updated 5 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- A Python library transfers PyTorch tensors between CPU and NVMe☆113Updated 5 months ago
- ☆31Updated this week