sunkx109 / My-Torch-ExtensionLinks

A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.

☆34

Alternatives and similar repositories for My-Torch-Extension

Users that are interested in My-Torch-Extension are comparing it to the libraries listed below

Sorting:

ifromeast / cuda_learning
learning how CUDA works
☆324Updated 7 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆426Updated 5 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆241Updated 3 months ago
AyakaGEMM / Hands-on-GEMM
☆139Updated last year
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆108Updated 3 months ago
harleyszhang / lite_llama
A light llama-like llm inference framework based on the triton kernel.
☆157Updated 3 weeks ago
RussWong / CUDATutorial
A CUDA tutorial to make people learn CUDA program from 0
☆252Updated last year
reed-lau / cute-gemm
☆135Updated 10 months ago
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆409Updated last year
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆381Updated this week
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆279Updated last year
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆220Updated 2 months ago
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆537Updated 3 weeks ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 9 months ago
nicolaswilde / cuda-sgemm
☆69Updated 9 months ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆456Updated last year
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆482Updated last year
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆132Updated 2 years ago
weishengying / tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆49Updated last year
mdy666 / mdy_triton
☆147Updated 3 months ago
DefTruth / CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆46Updated 5 months ago
CalebDu / Awesome-Cute
☆106Updated 4 months ago
YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆71Updated 2 years ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆62Updated 11 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆40Updated 7 months ago
zjhellofss / triton_course
☆36Updated 5 months ago
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆171Updated 2 weeks ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆75Updated last year
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆155Updated 8 months ago
RussWong / LLM-engineering
☆25Updated 2 months ago