MyCaffe / NCCL
Windows version of NVIDIA's NCCL ('Nickel') for multi-GPU training - please use https://github.com/NVIDIA/nccl for changes.
☆57Updated last year
Alternatives and similar repositories for NCCL:
Users that are interested in NCCL are comparing it to the libraries listed below
- An easy way to run, test, benchmark and tune OpenCL kernel files☆23Updated last year
- ONNX Runtime: cross-platform, high performance scoring engine for ML models☆58Updated this week
- AMD's graph optimization engine.☆196Updated this week
- A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface☆90Updated last month
- Computation using data flow graphs for scalable machine learning☆67Updated this week
- cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it☆492Updated last week
- Fork of https://source.codeaurora.org/quic/hexagon_nn/nnlib☆57Updated last year
- Large Language Model Onnx Inference Framework☆28Updated 3 weeks ago
- oneCCL Bindings for Pytorch*☆88Updated last month
- OneFlow->ONNX☆42Updated last year
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆83Updated 11 months ago
- ☆38Updated 2 years ago
- ☆11Updated last year
- Common libraries for PPL projects☆29Updated 3 months ago
- The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.☆132Updated 3 weeks ago
- Development repository for the Triton language and compiler☆105Updated this week
- how to design cpu gemm on x86 with avx256, that can beat openblas.☆67Updated 5 years ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Updated last year
- ☆124Updated last year
- The note of Qualcomm OpenCL SDK☆30Updated 6 years ago
- ☆27Updated last year
- ☆69Updated last year
- The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Inte…☆16Updated 5 years ago
- ☆58Updated 4 years ago
- ☆84Updated last year
- ☆104Updated 2 months ago
- a c++/cuda template library for tensor lazy evaluation☆163Updated last year
- Stretching GPU performance for GEMMs and tensor contractions.☆231Updated this week
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆99Updated this week
- stable diffusion using mnn☆65Updated last year