dlsyscourse / lecture5
☆18Updated last week
Related projects: ⓘ
- ☆6Updated last week
- ☆188Updated last week
- Tutorials for writing high-performance GPU operators in AI frameworks.☆118Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆33Updated 3 years ago
- ☆21Updated 4 months ago
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆113Updated 3 years ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆138Updated 2 years ago
- Machine Learning Compiler Road Map☆40Updated last year
- CUDA Matrix Multiplication Optimization☆118Updated 2 months ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆129Updated last year
- ☆90Updated 6 months ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆21Updated 9 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆20Updated last week
- A Easy-to-understand TensorOp Matmul Tutorial☆265Updated this week
- Imperative deep learning framework with customized GPU and CPU backend☆28Updated last year
- play gemm with tvm☆81Updated last year
- ☆67Updated last week
- flash attention tutorial written in python, triton, cuda, cutlass☆159Updated 3 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆166Updated 11 months ago
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆114Updated last week
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆79Updated last year
- ☆138Updated 2 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆70Updated last month
- A simple high performance CUDA GEMM implementation.☆319Updated 8 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆93Updated last week
- ☆71Updated last year
- Solution of Programming Massively Parallel Processors☆29Updated 8 months ago
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections☆112Updated 2 years ago
- ☆134Updated last year
- ☆127Updated last month