ailzhang / EfficientPyTorchLinks

Code release for book "Efficient Training in PyTorch"

☆110

Alternatives and similar repositories for EfficientPyTorch

Users that are interested in EfficientPyTorch are comparing it to the libraries listed below

Sorting:

SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆560Updated 2 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆229Updated 3 months ago
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆117Updated last year
pprp / ultrascale-playbook-zh
UltraScale Playbook 中文版
☆89Updated 8 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆448Updated 6 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆112Updated 4 months ago
InternLM / turbomind
☆97Updated 7 months ago
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆64Updated 3 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆249Updated 4 months ago
fzyzcjy / torch_memory_saver
Allow torch tensor memory to be released and resumed later
☆167Updated last week
Starmys / TritonStudyGroup
☆95Updated 2 months ago
mit-han-lab / parallel-computing-tutorial
☆176Updated 2 years ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆70Updated 5 months ago
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆95Updated 3 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 11 months ago
ifromeast / cuda_learning
learning how CUDA works
☆338Updated 8 months ago
madsys-dev / deepseekv2-profile
☆151Updated 8 months ago
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆284Updated last year
MLSys-Learner-Resources / Awesome-MLSys-Blogger
The repository has collected a batch of noteworthy MLSys bloggers (Algorithms/Systems)
☆300Updated 10 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆66Updated last year
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆483Updated this week
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆127Updated 6 months ago
InternLM / Awesome-LLM-Training-System
☆44Updated last year
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆644Updated last week
mdy666 / mdy_triton
☆148Updated 4 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆119Updated last year
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆279Updated 8 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆134Updated 2 years ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆288Updated 5 months ago