gpu-mode / triton-tutorialsLinks

☆14

Alternatives and similar repositories for triton-tutorials

Users that are interested in triton-tutorials are comparing it to the libraries listed below

Sorting:

graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆46Updated last year
UmerHA / triton_util
Make triton easier
☆47Updated last year
lernapparat / torchhacks
Hacks for PyTorch
☆19Updated 2 years ago
alexzhang13 / Triton-Puzzles-Solutions
Personal solutions to the Triton Puzzles
☆19Updated last year
drisspg / transformer_nuggets
A place to store reusable transformer components of my own creation or found on the interwebs
☆59Updated last week
NVlabs / EfficientDL
☆33Updated last month
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆71Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆72Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆152Updated last month
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated 10 months ago
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated 2 years ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆44Updated 8 months ago
facebookresearch / any4
Quantize transformers to any learned arbitrary 4-bit numeric format
☆39Updated 3 weeks ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
BBuf / flash-rwkv
☆32Updated last year
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
aniquetahir / JORA
JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)
☆35Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
tridao / flash-attention-wheels
☆53Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆44Updated 10 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
facebookresearch / Ternary_Binary_Transformer
ACL 2023
☆39Updated 2 years ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 3 weeks ago
ahennequ / pytorch-custom-mma
☆29Updated 2 years ago
kshitij12345 / torchnnprofiler
Context Manager to profile the forward and backward times of PyTorch's nn.Module
☆83Updated last year
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated last week
cmd2001 / KVTuner
KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
☆17Updated 2 months ago
Dao-AILab / gemm-cublas
☆22Updated 3 months ago