hkproj / triton-flash-attentionLinks

☆222

Alternatives and similar repositories for triton-flash-attention

Users that are interested in triton-flash-attention are comparing it to the libraries listed below

Sorting:

MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆244Updated 7 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆274Updated 2 months ago
evintunador / triton_docs_tutorials
making the official triton tutorials actually comprehensible
☆75Updated 3 months ago
rkinas / triton-resources
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
☆438Updated 8 months ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆117Updated last week
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆196Updated 6 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆402Updated 2 weeks ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆308Updated 3 months ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆126Updated last year
Deep-Learning-Profiling-Tools / triton-viz
☆257Updated this week
gpu-mode / profiling-cuda-in-torch
☆177Updated last year
wolfecameron / nanoMoE
An extension of the nanoGPT repository for training small MOE models.
☆215Updated 8 months ago
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆172Updated 8 months ago
huggingface / picotron_tutorial
☆224Updated last week
NVIDIA / kvpress
LLM KV cache compression made easy
☆701Updated this week
1y33 / 100Days
GPU Kernels
☆209Updated 7 months ago
hkproj / pytorch-llama
LLaMA 2 implemented from scratch in PyTorch
☆361Updated 2 years ago
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆271Updated last week
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆584Updated 3 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆297Updated this week
pranjalssh / fast.cu
Fastest kernels written from scratch
☆405Updated 2 months ago
changjonathanc / flex-nano-vllm
FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆313Updated last month
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆392Updated 5 months ago
huggingface / kernels
Load compute kernels from the Hub
☆337Updated last week
lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
☆161Updated last week
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆264Updated last month
hao-ai-lab / cse234-w25-PA
☆44Updated 8 months ago
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆164Updated last week
aryagxr / cuda
coding CUDA everyday!
☆71Updated 3 weeks ago