ROCm / apex
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
☆21Updated this week
Alternatives and similar repositories for apex:
Users that are interested in apex are comparing it to the libraries listed below
- ☆26Updated this week
- oneCCL Bindings for Pytorch*☆91Updated this week
- RCCL Performance Benchmark Tests☆60Updated 3 weeks ago
- Ahead of Time (AOT) Triton Math Library☆56Updated 2 weeks ago
- OpenAI Triton backend for Intel® GPUs☆172Updated this week
- Bandwidth test for ROCm☆54Updated 3 weeks ago
- ☆20Updated last week
- ☆21Updated last month
- Benchmark code for the "Online normalizer calculation for softmax" paper☆87Updated 6 years ago
- ☆49Updated last year
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆32Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆243Updated 5 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆62Updated 3 weeks ago
- An extension library of WMMA API (Tensor Core API)☆93Updated 8 months ago
- Development repository for the Triton language and compiler☆114Updated this week
- CUDA Templates for Linear Algebra Subroutines☆16Updated this week
- ☆60Updated 3 months ago
- Benchmarks to capture important workloads.☆30Updated 2 months ago
- ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs☆79Updated last week
- Experimental projects related to TensorRT☆95Updated this week
- Provides the examples to write and build Habana custom kernels using the HabanaTools☆21Updated 4 months ago
- ☆193Updated 8 months ago
- ☆63Updated last week
- oneAPI Collective Communications Library (oneCCL)☆227Updated this week
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆129Updated last year
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆66Updated this week
- A Python library transfers PyTorch tensors between CPU and NVMe☆111Updated 4 months ago
- RDC☆27Updated this week
- Fast and memory-efficient exact attention☆163Updated this week
- ☆91Updated 6 months ago