IBM / pytorch-communication-benchmarks
pytorch code examples for measuring the performance of collective communication calls in AI workloads
☆16Updated 5 months ago
Alternatives and similar repositories for pytorch-communication-benchmarks:
Users that are interested in pytorch-communication-benchmarks are comparing it to the libraries listed below
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆107Updated this week
- ☆65Updated last week
- A minimal implementation of vllm.☆37Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆103Updated 8 months ago
- ☆193Updated 8 months ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 3 months ago
- ☆76Updated 4 months ago
- extensible collectives library in triton☆84Updated this week
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated last year
- A Python library transfers PyTorch tensors between CPU and NVMe☆111Updated 4 months ago
- High performance Transformer implementation in C++.☆113Updated 2 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- Applied AI experiments and examples for PyTorch☆251Updated 2 weeks ago
- Ahead of Time (AOT) Triton Math Library☆56Updated 2 weeks ago
- MLPerf™ logging library☆33Updated this week
- Perplexity GPU Kernels☆134Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆243Updated 5 months ago
- ☆54Updated 6 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆38Updated this week
- ☆68Updated 2 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆79Updated 7 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆190Updated this week
- Collection of kernels written in Triton language☆117Updated last month
- nnScaler: Compiling DNN models for Parallel Training☆103Updated last month
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆79Updated 4 months ago
- LLM Serving Performance Evaluation Harness☆73Updated last month
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆161Updated last week
- ☆26Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆34Updated this week