π Efficient implementations of state-of-the-art linear attention models
β4,630Mar 18, 2026Updated this week
Alternatives and similar repositories for flash-linear-attention
Users that are interested in flash-linear-attention are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β977Feb 5, 2026Updated last month
- FlashInfer: Kernel Library for LLM Servingβ5,194Updated this week
- Tile primitives for speedy kernelsβ3,244Updated this week
- π₯ A minimal training framework for scaling FLA modelsβ358Nov 15, 2025Updated 4 months ago
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsβ5,403Updated this week
- A PyTorch native platform for training generative AI modelsβ5,162Updated this week
- Fast and memory-efficient exact attentionβ22,832Updated this week
- Efficient Triton Kernels for LLM Trainingβ6,216Updated this week
- Ring attention implementation with flash attentionβ996Sep 10, 2025Updated 6 months ago
- Distributed Compiler based on Triton for Parallel Systemsβ1,394Mar 11, 2026Updated last week
- Helpful tools and examples for working with flex-attentionβ1,161Feb 8, 2026Updated last month
- Development repository for the Triton language and compilerβ18,708Updated this week
- Puzzles for learning Tritonβ2,336Updated this week
- SGLang is a high-performance serving framework for large language models and multimodal models.β24,829Updated this week
- MoBA: Mixture of Block Attention for Long-Context LLMsβ2,083Apr 3, 2025Updated 11 months ago
- verl: Volcano Engine Reinforcement Learning for LLMsβ20,097Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β3,231Updated this week
- PyTorch native quantization and sparsity for training and inferenceβ2,739Updated this week
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ516Mar 13, 2026Updated last week
- Mamba SSM architectureβ17,524Updated this week
- Understand and test language model architectures on synthetic tasks.β263Updated this week
- Ongoing research training transformer models at scaleβ15,744Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernelβ2,159Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ341Feb 23, 2025Updated last year
- Muon is Scalable for LLM Trainingβ1,446Aug 3, 2025Updated 7 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β821Mar 6, 2025Updated last year
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β3,958Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,198Mar 9, 2026Updated 2 weeks ago
- An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)β9,191Mar 16, 2026Updated last week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.β1,273Aug 28, 2025Updated 6 months ago
- A Quirky Assortment of CuTe Kernelsβ863Updated this week
- Accelerated First Order Parallel Associative Scanβ196Jan 7, 2026Updated 2 months ago
- CUDA Templates and Python DSLs for High-Performance Linear Algebraβ9,484Updated this week
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β250Jun 6, 2025Updated 9 months ago
- β107Mar 9, 2024Updated 2 years ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scalingβ6,268Feb 27, 2026Updated 3 weeks ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Trainingβ676Mar 16, 2026Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β598Aug 12, 2025Updated 7 months ago
- β125Feb 4, 2026Updated last month