inducer / loopy
A code generator for array-based code on CPUs and GPUs
☆598Updated this week
Alternatives and similar repositories for loopy:
Users that are interested in loopy are comparing it to the libraries listed below
- common in-memory tensor structure☆942Updated 2 weeks ago
- Library for specialized dense and sparse matrix operations, and deep learning primitives.☆862Updated this week
- The Foundation for All Legate Libraries☆204Updated last week
- ☆406Updated this week
- The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs☆1,277Updated 10 months ago
- Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX☆220Updated 4 years ago
- CLTune: An automatic OpenCL & CUDA kernel tuner☆173Updated 2 years ago
- DaCe - Data Centric Parallel Programming☆508Updated this week
- Symbolic Expression and Statement Module for new DSLs☆205Updated 4 years ago
- Stretching GPU performance for GEMMs and tensor contractions.☆233Updated this week
- A suite of benchmarks for CPU and GPU performance of the most popular high-performance libraries for Python☆315Updated 4 months ago
- Kernel Tuner☆311Updated last week
- ☆233Updated 2 years ago
- Backward compatible ML compute opset inspired by HLO/MHLO☆446Updated this week
- Python interface for MLIR - the Multi-Level Intermediate Representation☆240Updated 2 months ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆521Updated this week
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆347Updated this week
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆606Updated 3 months ago
- CUDA Kernel Benchmarking Library☆560Updated 3 months ago
- RAPIDS Memory Manager☆534Updated this week
- NPBench - A Benchmarking Suite for High-Performance NumPy☆77Updated this week
- [ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl☆1,719Updated last year
- The Tensor Algebra SuperOptimizer for Deep Learning☆696Updated 2 years ago
- Programmable CUDA/C++ GPU Graph Analytics☆1,006Updated 6 months ago
- Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure☆812Updated this week
- Dive into Deep Learning Compiler☆647Updated 2 years ago
- oneAPI Math Library (oneMath)☆645Updated 2 weeks ago
- This is a set of simple programs that can be used to explore the features of a parallel platform.☆420Updated 2 months ago
- RFC document, tooling and other content related to the array API standard☆226Updated this week
- A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory☆296Updated 6 years ago