ademeure / QuickRunCUDALinks
☆13Updated 3 weeks ago
Alternatives and similar repositories for QuickRunCUDA
Users that are interested in QuickRunCUDA are comparing it to the libraries listed below
Sorting:
- extensible collectives library in triton☆90Updated 6 months ago
- ☆46Updated 5 months ago
- ☆35Updated this week
- How to ensure correctness and ship LLM generated kernels in PyTorch☆107Updated this week
- Triton-based Symmetric Memory operators and examples☆48Updated last week
- Automatic differentiation for Triton Kernels☆11Updated 2 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆100Updated 4 months ago
- ☆65Updated 6 months ago
- DeeperGEMM: crazy optimized version☆72Updated 5 months ago
- A bunch of kernels that might make stuff slower 😉☆62Updated last week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆69Updated last month
- ☆93Updated 11 months ago
- ☆31Updated 3 months ago
- Debug print operator for cudagraph debugging☆14Updated last year
- Framework to reduce autotune overhead to zero for well known deployments.☆84Updated last month
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆119Updated 3 weeks ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆60Updated this week
- ☆50Updated 5 months ago
- Effective transpose on Hopper GPU☆25Updated last month
- An experimental communicating attention kernel based on DeepEP.☆34Updated 2 months ago
- Github mirror of trition-lang/triton repo.☆92Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆117Updated last year
- ☆42Updated last month
- ☆82Updated 9 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆84Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆164Updated this week
- Benchmark tests supporting the TiledCUDA library.☆17Updated 11 months ago
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆93Updated this week
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 5 months ago
- ☆101Updated 5 months ago