inducer / loopy
A code generator for array-based code on CPUs and GPUs
☆595Updated this week
Alternatives and similar repositories for loopy:
Users that are interested in loopy are comparing it to the libraries listed below
- Library for specialized dense and sparse matrix operations, and deep learning primitives.☆856Updated this week
- common in-memory tensor structure☆930Updated 3 months ago
- The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs☆1,267Updated 9 months ago
- Stretching GPU performance for GEMMs and tensor contractions.☆231Updated this week
- Kernel Tuner☆303Updated this week
- DaCe - Data Centric Parallel Programming☆502Updated this week
- The Foundation for All Legate Libraries☆202Updated 3 weeks ago
- CLTune: An automatic OpenCL & CUDA kernel tuner☆172Updated 2 years ago
- ☆402Updated this week
- Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX☆221Updated 4 years ago
- Programmable CUDA/C++ GPU Graph Analytics☆999Updated 5 months ago
- The Legion Parallel Programming System☆700Updated last week
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆519Updated 7 months ago
- CUSP : A C++ Templated Sparse Matrix Library☆408Updated 2 months ago
- CUDA Kernel Benchmarking Library☆547Updated last month
- Python interface for MLIR - the Multi-Level Intermediate Representation☆235Updated last month
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆330Updated 3 weeks ago
- Open single and half precision gemm implementations☆373Updated last year
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆197Updated last month
- Source code that accompanies The CUDA Handbook.☆510Updated last month
- Pluto: An automatic polyhedral parallelizer and locality optimizer☆280Updated 8 months ago
- [ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl☆1,690Updated last year
- A suite of benchmarks for CPU and GPU performance of the most popular high-performance libraries for Python☆315Updated 3 months ago
- A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory☆296Updated 6 years ago
- ☆503Updated this week
- TVM integration into PyTorch☆453Updated 5 years ago
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆587Updated 2 months ago
- The Tensor Algebra SuperOptimizer for Deep Learning☆696Updated last year
- Symbolic Expression and Statement Module for new DSLs☆205Updated 4 years ago
- Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.☆262Updated this week