tingshua-yts / BetterDL
☆28Updated last year
Related projects: ⓘ
- ☆90Updated 6 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆118Updated last year
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆113Updated 3 years ago
- A simple deep learning framework that supports automatic differentiation and GPU acceleration.☆55Updated last year
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆41Updated last month
- CUDA 6大并行计算模式 代码与笔记☆57Updated 4 years ago
- ☆48Updated 2 years ago
- Trans different platform's network to International Representation(IR)☆44Updated 6 years ago
- A tutorial for CUDA&PyTorch☆110Updated this week
- flash attention tutorial written in python, triton, cuda, cutlass☆159Updated 3 months ago
- ☆100Updated 5 months ago
- ☆77Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆20Updated last week
- ☆45Updated 4 years ago
- ☆36Updated 2 years ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …☆101Updated 9 months ago
- A self-learning tutorail for CUDA High Performance Programing.☆119Updated 2 months ago
- Machine Learning Compiler Road Map☆40Updated last year
- ☆52Updated this week
- ☆133Updated 2 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆138Updated 2 years ago
- Simple CuDNN wrapper☆29Updated 8 years ago
- Implementation of FlashAttention in PyTorch☆95Updated last year
- ATC23 AE☆42Updated last year
- ☆140Updated 4 months ago
- Transformer related optimization, including BERT, GPT☆58Updated last year
- Simple PyTorch graph capturing.☆13Updated last year
- play gemm with tvm☆81Updated last year
- learning how CUDA works☆150Updated last month
- ☆60Updated last month