microsoft / cusync
☆20Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for cusync
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated last month
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 6 months ago
- Triton to TVM transpiler.☆16Updated last month
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆14Updated 5 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- HeteroSync is a benchmark suite for performing fine-grained synchronization on tightly coupled GPUs☆27Updated 2 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆12Updated this week
- ☆31Updated last year
- ☆29Updated 2 years ago
- HeteroCL-MLIR dialect for accelerator design☆40Updated 2 months ago
- ☆40Updated 3 years ago
- PTX-EMU is a simple emulator for CUDA program.☆24Updated 10 months ago
- A GPU FP32 computation method with Tensor Cores.☆18Updated 2 years ago
- ☆32Updated 2 years ago
- ☆80Updated 7 months ago
- Optimize tensor program fast with Felix, a gradient descent autotuner.☆19Updated 6 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- A lightweight, Pythonic, frontend for MLIR☆80Updated last year
- ☆38Updated 4 years ago
- ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch☆29Updated 3 months ago
- TileFlow is a performance analysis tool based on Timeloop for fusion dataflows☆55Updated 7 months ago
- Microsoft Collective Communication Library☆54Updated this week
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- ☆10Updated 2 years ago
- An Attention Superoptimizer☆20Updated 6 months ago
- Implement Flash Attention using Cute.☆39Updated this week
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆81Updated 2 years ago
- The translator that supports translating NVPTX to SPIR-V. This translator is modified from LLVM-SPIR-V Translator.☆33Updated 3 years ago