sonnyli / flash_attention_from_scratchLinks

Flash Attention from Scratch on CUDA Ampere

☆37

Alternatives and similar repositories for flash_attention_from_scratch

Users that are interested in flash_attention_from_scratch are comparing it to the libraries listed below

Sorting:

gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆49Updated this week
toyaix / triton-runner
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
☆76Updated this week
apache / tvm-ffi
Open ABI and FFI for Machine Learning Systems
☆174Updated this week
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆117Updated last year
HPMLL / NVIDIA-Hopper-Benchmark
☆64Updated 5 months ago
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆64Updated 3 months ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆143Updated 2 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆80Updated last week
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆187Updated last month
yuyangJin / PerFlow-AI
PerFlow-AI is a programmable performance analysis, modeling, prediction tool for AI system.
☆24Updated this week
xinhao-luo / ClusterFusion
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆49Updated last month
InfiniTensor / InfiniTensor
☆273Updated 3 weeks ago
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆156Updated 2 years ago
shenh10 / DeepSeek_Simulator
☆90Updated 7 months ago
tsinghua-ideal / Canvas
Canvas: End-to-End Kernel Architecture Search in Neural Networks
☆26Updated last year
Hsword / Awesome-Machine-Learning-System-Papers
☆79Updated 3 years ago
aikitoria / nanotrace
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
☆108Updated last week
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 9 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
wu-kan / wuk_cupti_wrapper
a simple API to use CUPTI
☆11Updated 3 months ago
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆168Updated 7 months ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆138Updated 2 weeks ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 11 months ago
infinigence / HamiltonAttention
☆33Updated last month
melonedo / algebraic-layouts
☆14Updated 3 months ago
AlibabaResearch / mononn
☆32Updated last year
DD-DuDa / BitDecoding
[HPCA 2025] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆62Updated last week
CalebDu / Awesome-Cute
☆110Updated 6 months ago