dlsyscourse / hw2Links

☆8

Alternatives and similar repositories for hw2

Users that are interested in hw2 are comparing it to the libraries listed below

Sorting:

caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆58Updated 8 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆129Updated last year
hyperai / triton-cn
Triton Documentation in Chinese Simplified / Triton 中文文档
☆75Updated 3 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆99Updated 2 weeks ago
l1nkr / DL-Compiler-Navigation
Machine Learning Compiler Road Map
☆43Updated last year
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆74Updated 5 months ago
mdy666 / mdy_triton
☆139Updated 3 weeks ago
eedalong / ECE408
Code base and slides for ECE408：Applied Parallel Programming On GPU.
☆128Updated 4 years ago
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆155Updated last year
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
madsys-dev / deepseekv2-profile
☆145Updated 4 months ago
tongzhou80 / nanoPyC
☆70Updated 2 years ago
harleyszhang / lite_llama
A light llama-like llm inference framework based on the triton kernel.
☆138Updated this week
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆22Updated 5 months ago
ysj1173886760 / PyToy
deep learning framework from scratch
☆30Updated 3 years ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆440Updated last year
kcxain / dlsys
My solutions to the assignments of CMU 10-714 Deep Learning Systems 2022
☆40Updated last year
ailzhang / EfficientPyTorch
Code release for book "Efficient Training in PyTorch"
☆78Updated 3 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆89Updated 7 months ago
PFCCLab / Camp
飞桨护航计划集训营
☆18Updated 2 months ago
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆102Updated 9 months ago
lzyrapx / LeetGPU
Solutions of LeetGPU
☆29Updated this week
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉
☆192Updated 2 months ago
ZonePG / cs-notes
my cs notes
☆53Updated 9 months ago
frankwang0818 / AI_compiler_development_guide
Free resource for the book AI Compiler Development Guide
☆45Updated 2 years ago
dlsyscourse / hw0
☆38Updated last year
InfiniTensor / InfiniTensor
☆246Updated last month
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆87Updated 2 months ago
InternLM / Awesome-LLM-Training-System
☆41Updated 11 months ago
InternLM / turbomind
☆90Updated 4 months ago