fangjunzhou / blas-playgroundLinks
Playground project for BLAS demo.
☆29Updated last year
Alternatives and similar repositories for blas-playground
Users that are interested in blas-playground are comparing it to the libraries listed below
Sorting:
- Ray tracer using no GPU acceleration to see how far we can push the limits☆21Updated 10 months ago
- Simple C++ borrow checker☆68Updated 2 years ago
- A GLSL compiler targeting SPIR-V mlir☆20Updated 8 months ago
- A lightweight memory allocator for hardware-accelerated machine learning☆150Updated 3 months ago
- RDNA3 emulator☆54Updated 2 months ago
- AMD’s C++ library for accelerating tensor primitives☆42Updated this week
- ☆199Updated 2 years ago
- ☆46Updated last week
- Nvidia Instruction Set Specification Generator☆278Updated 11 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆352Updated 2 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆119Updated this week
- A Python Compiler Design Toolkit☆366Updated this week
- Super fast FP32 matrix multiplication on RDNA3☆66Updated 2 months ago
- chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.☆284Updated this week
- Powerful automatic differentiation in C++ and Python☆374Updated 2 weeks ago
- Tenstorrent MLIR compiler☆141Updated this week
- Scientific computing with Metal in C++: Matrix multiplication example☆31Updated 2 years ago
- C++ compile-time Rust's like macro_rules implementation☆92Updated last year
- C++20 Tensor library☆27Updated 2 months ago
- A Toolkit for Programming Parallel Algorithms on Shared-Memory Multicore Machines☆366Updated last month
- Reference Implementation for stdBLAS☆143Updated last month
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆172Updated this week
- rocWMMA☆117Updated this week
- Fork of https://gitlab.mpcdf.mpg.de/mtr/pocketfft to simplify external contributions☆94Updated 6 months ago
- Source code for 'Modern Parallel Programming with C++ and Assembly' by Dan Kusswurm☆64Updated 3 years ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆104Updated 3 months ago
- Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research☆109Updated last year
- C implementation of the L-Mul f32/f16 multiplications from paper: https://arxiv.org/html/2410.00907☆28Updated 8 months ago
- Teaching Vectorization and SIMD using Intel Intrinsics in a Computer Organization and Architecture class☆15Updated 4 months ago
- CUDA implementation of parallel Depth First Search (DFS) algorithm and it's comparison with a serial C++ DFS implementation.☆29Updated 7 years ago