hkproj / triton-flash-attention
☆125Updated last month
Alternatives and similar repositories for triton-flash-attention:
Users that are interested in triton-flash-attention are comparing it to the libraries listed below
- ☆141Updated last year
- ring-attention experiments☆123Updated 3 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆263Updated last week
- ☆132Updated this week
- ☆175Updated this week
- Triton implementation of GPT/LLAMA☆16Updated 5 months ago
- Cataloging released Triton kernels.☆164Updated last month
- Notes on quantization in neural networks☆68Updated last year
- LLM KV cache compression made easy☆384Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆270Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆221Updated this week
- Efficient LLM Inference over Long Sequences☆356Updated this week
- LoRA and DoRA from Scratch Implementations☆195Updated 11 months ago
- Distributed training (multi-node) of a Transformer model☆53Updated 10 months ago
- LLaMA 2 implemented from scratch in PyTorch☆292Updated last year
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆94Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 9 months ago
- Learnings and programs related to CUDA☆229Updated this week
- ☆165Updated 2 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆513Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆717Updated this week
- Best practices & guides on how to write distributed pytorch training code☆348Updated 3 weeks ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆169Updated last year
- Code for studying the super weight in LLM☆79Updated 2 months ago
- Complete implementation of Llama2 with/without KV cache & inference 🚀☆47Updated 8 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆117Updated last year
- Applied AI experiments and examples for PyTorch☆223Updated this week