dlsyscourse / hw4Links

☆3

Alternatives and similar repositories for hw4

Users that are interested in hw4 are comparing it to the libraries listed below

Sorting:

dlsyscourse / hw2
☆8Updated 9 months ago
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆74Updated 5 months ago
dlsyscourse / hw0
☆38Updated last year
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆58Updated 8 months ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆129Updated last year
dlsyscourse / lecture5
☆20Updated 10 months ago
dlsyscourse / hw1
☆8Updated 10 months ago
tongzhou80 / nanoPyC
☆70Updated 2 years ago
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆22Updated 5 months ago
YangLinzhuo / cuda-sgemm-optimization
CUDA SGEMM optimization note
☆12Updated last year
lzyrapx / LeetGPU
Solutions of LeetGPU
☆29Updated this week
weishengying / cute_gemm
☆14Updated 11 months ago
zjhellofss / KuiperCourse
b站上的课程
☆75Updated last year
Cjkkkk / KgeN
A TVM-like CUDA/C code generator.
☆9Updated 3 years ago
caijixueIT / CUDA_Learning_for_Freshman
☆11Updated 4 months ago
frankwang0818 / AI_compiler_development_guide
Free resource for the book AI Compiler Development Guide
☆45Updated 2 years ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆93Updated last week
weishengying / tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆44Updated 11 months ago
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
Ascend / torchair
☆16Updated this week
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆87Updated 2 months ago
Bruce-Lee-LY / cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆18Updated last week
gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆31Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
l1nkr / DL-Compiler-Navigation
Machine Learning Compiler Road Map
☆43Updated last year
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆35Updated last year
XiaoSong9905 / dgemm-knl
DGEMM on KNL, achieve 75% MKL
☆18Updated 3 years ago
eedalong / ECE408
Code base and slides for ECE408：Applied Parallel Programming On GPU.
☆128Updated 4 years ago
OpenPPL / ppl.kernel.cuda
☆37Updated 9 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated 10 months ago