leo811121 / UIUC-CS-483-Parallel-Programming
☆19Updated 5 years ago
Alternatives and similar repositories for UIUC-CS-483-Parallel-Programming:
Users that are interested in UIUC-CS-483-Parallel-Programming are comparing it to the libraries listed below
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆111Updated last week
- ☆67Updated this week
- Collection of kernels written in Triton language☆120Updated 3 weeks ago
- DeeperGEMM: crazy optimized version☆67Updated 3 weeks ago
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆67Updated 4 years ago
- ☆82Updated last month
- ☆55Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆65Updated last month
- ☆166Updated last year
- ring-attention experiments☆130Updated 6 months ago
- ☆103Updated 8 months ago
- ☆31Updated 3 months ago
- ☆153Updated last year
- Fast low-bit matmul kernels in Triton☆291Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆165Updated last month
- Write a fast kernel and run it on Discord. See how you compare against the best!☆41Updated this week
- Mixed precision training from scratch with Tensors and CUDA☆22Updated 11 months ago
- Load compute kernels from the Hub☆115Updated this week
- ☆87Updated last year
- A minimal implementation of vllm.☆39Updated 8 months ago
- ☆29Updated last month
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 9 months ago
- Cataloging released Triton kernels.☆217Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆59Updated 3 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated this week
- 📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.☆169Updated 2 weeks ago
- a minimal cache manager for PagedAttention, on top of llama3.☆83Updated 8 months ago
- ☆200Updated this week
- PyTorch bindings for CUTLASS grouped GEMM.☆81Updated 5 months ago
- ☆157Updated 3 months ago