fla-org / flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
☆1,669Updated this week
Alternatives and similar repositories for flash-linear-attention:
Users that are interested in flash-linear-attention are comparing it to the libraries listed below
- Puzzles for learning Triton☆1,300Updated last month
- Helpful tools and examples for working with flex-attention☆583Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆505Updated 2 months ago
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States☆1,092Updated 6 months ago
- Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters☆477Updated this week
- Large Context Attention☆670Updated 5 months ago
- Tile primitives for speedy kernels☆1,923Updated this week
- Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"☆831Updated last month
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆492Updated 2 months ago
- Annotated version of the Mamba paper☆469Updated 10 months ago
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆714Updated this week
- A bibliography and survey of the papers surrounding o1☆1,042Updated 2 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,797Updated this week
- A simple and efficient Mamba implementation in pure PyTorch and MLX.☆1,101Updated last month
- Code for BLT research paper☆1,314Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆644Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆2,086Updated this week
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch☆609Updated last month
- Building blocks for foundation models.☆435Updated last year
- A collection of AWESOME things about mixture-of-experts☆1,026Updated last month
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection☆1,481Updated 2 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation☆681Updated 3 months ago
- Pipeline Parallelism for PyTorch☆736Updated 4 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,179Updated 3 months ago
- Ring attention implementation with flash attention☆645Updated 3 weeks ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆681Updated 2 weeks ago
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆912Updated 2 weeks ago
- Minimalistic large language model 3D-parallelism training☆1,386Updated this week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,316Updated 6 months ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States☆384Updated 5 months ago