fla-org / flash-linear-attentionLinks
π Efficient implementations of state-of-the-art linear attention models
β3,110Updated this week
Alternatives and similar repositories for flash-linear-attention
Users that are interested in flash-linear-attention are comparing it to the libraries listed below
Sorting:
- Muon is an optimizer for hidden layers in neural networksβ1,672Updated 2 months ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β847Updated 5 months ago
- Puzzles for learning Tritonβ1,978Updated 9 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeβ1,715Updated 2 weeks ago
- FlashInfer: Kernel Library for LLM Servingβ3,723Updated this week
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ740Updated 3 weeks ago
- Tile primitives for speedy kernelsβ2,672Updated last week
- Helpful tools and examples for working with flex-attentionβ961Updated 3 weeks ago
- PyTorch native quantization and sparsity for training and inferenceβ2,341Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernelβ1,773Updated this week
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,248Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaβ¦β2,720Updated this week
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,635Updated this week
- Muon is Scalable for LLM Trainingβ1,302Updated last month
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsβ1,608Updated this week
- Minimalistic large language model 3D-parallelism trainingβ2,191Updated last week
- A PyTorch native platform for training generative AI modelsβ4,372Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,246Updated last month
- Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4β907Updated 2 weeks ago
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,212Updated 2 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,492Updated last year
- Flash Attention in ~100 lines of CUDA (forward pass only)β918Updated 8 months ago
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projectionβ1,596Updated 10 months ago
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ917Updated last week
- Ring attention implementation with flash attentionβ864Updated last month
- Implementing DeepSeek R1's GRPO algorithm from scratchβ1,561Updated 4 months ago
- A bibliography and survey of the papers surrounding o1β1,208Updated 9 months ago
- slime is a LLM post-training framework for RL Scaling.β1,652Updated this week
- A collection of AWESOME things about mixture-of-expertsβ1,198Updated 9 months ago
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Modelsβ1,584Updated last year