hkproj / triton-flash-attention
☆136Updated 2 months ago
Alternatives and similar repositories for triton-flash-attention:
Users that are interested in triton-flash-attention are comparing it to the libraries listed below
- Cataloging released Triton kernels.☆204Updated 2 months ago
- ☆151Updated last year
- Collection of kernels written in Triton language☆114Updated last month
- ring-attention experiments☆127Updated 5 months ago
- Distributed training (multi-node) of a Transformer model☆59Updated 11 months ago
- Notes on quantization in neural networks☆77Updated last year
- ☆158Updated last month
- ☆191Updated this week
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆174Updated last year
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆306Updated last week
- Applied AI experiments and examples for PyTorch☆249Updated this week
- Fast low-bit matmul kernels in Triton☆267Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆524Updated last month
- LoRA and DoRA from Scratch Implementations☆198Updated last year
- Fastest kernels written from scratch☆199Updated 2 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆126Updated last year
- Best practices & guides on how to write distributed pytorch training code☆373Updated 3 weeks ago
- Triton implementation of GPT/LLAMA☆16Updated 6 months ago
- LLM KV cache compression made easy☆440Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆232Updated 3 weeks ago
- Efficient LLM Inference over Long Sequences☆365Updated last month
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆234Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆232Updated 2 weeks ago
- LLaMA 2 implemented from scratch in PyTorch☆307Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 10 months ago