ROCm / pytorch-micro-benchmarking
☆16Updated last week
Related projects ⓘ
Alternatives and complementary repositories for pytorch-micro-benchmarking
- RCCL Performance Benchmark Tests☆50Updated 3 weeks ago
- ☆15Updated 2 months ago
- ROCm Communication Collectives Library (RCCL)☆270Updated this week
- RDC☆23Updated this week
- ☆30Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆45Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆321Updated last month
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆147Updated this week
- ☆12Updated 8 months ago
- Intel® Tensor Processing Primitives extension for Pytorch*☆10Updated last week
- OpenAI Triton backend for Intel® GPUs☆143Updated this week
- NCCL Profiling Kit☆112Updated 4 months ago
- HPCG benchmark based on ROCm platform☆35Updated 3 weeks ago
- Advanced Profiling and Analytics for AMD Hardware☆137Updated this week
- ROC_SHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆39Updated last year
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆57Updated 2 months ago
- ROC profiler library. Profiling with perf-counters and derived metrics.☆130Updated this week
- A system validation and diagnostics tool for monitoring, stress testing, detecting, and troubleshooting issues impacting AMD GPUs in high…☆66Updated this week
- Reference implementations of MLPerf™ HPC training benchmarks☆42Updated 5 months ago
- ☆59Updated this week
- Bandwidth test for ROCm☆49Updated this week
- ☆17Updated this week
- ☆30Updated this week
- PArallelLOOPgEneratoR: Threaded Loops Code Generation Infrastructure targeting Tensor Contraction Applications such as GEMMs, Convolution…☆18Updated last month
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆313Updated this week
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆68Updated 10 months ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 4 months ago
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆63Updated this week
- Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite☆60Updated 6 years ago
- Development repository for the Triton language and compiler☆93Updated this week