ailzhang / EfficientPyTorchLinks
Code release for book "Efficient Training in PyTorch"
☆121Updated 9 months ago
Alternatives and similar repositories for EfficientPyTorch
Users that are interested in EfficientPyTorch are comparing it to the libraries listed below
Sorting:
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆114Updated 6 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆602Updated last month
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆246Updated last week
- ☆178Updated 2 years ago
- High Performance LLM Inference Operator Library☆222Updated last week
- Examples of CUDA implementations by Cutlass CuTe☆269Updated 6 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆479Updated last week
- Codes & examples for "CUDA - From Correctness to Performance"☆120Updated last year
- Flash Attention from Scratch on CUDA Ampere☆121Updated 4 months ago
- UltraScale Playbook 中文版☆125Updated 10 months ago
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆147Updated 5 months ago
- Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM☆73Updated 5 months ago
- The repository has collected a batch of noteworthy MLSys bloggers (Algorithms/Systems)☆317Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆145Updated 8 months ago
- Implement Flash Attention using Cute.☆100Updated last year
- ☆155Updated 10 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆287Updated last year
- ☆116Updated 4 months ago
- ☆219Updated last year
- ☆112Updated 8 months ago
- LLM training technologies developed by kwai☆70Updated last week
- ☆105Updated last year
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆102Updated last week
- A tutorial for CUDA&PyTorch☆208Updated last week
- Pipeline Parallelism Emulation and Visualization☆76Updated 3 weeks ago
- ☆144Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆64Updated last year
- ☆152Updated 6 months ago
- NVIDIA cuTile learn☆150Updated last month
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆283Updated 10 months ago