☆272Jun 6, 2025Updated 9 months ago
Alternatives and similar repositories for log-linear-attention
Users that are interested in log-linear-attention are comparing it to the libraries listed below
Sorting:
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆129Jun 24, 2025Updated 8 months ago
- ☆226Nov 19, 2025Updated 3 months ago
- ☆136May 29, 2025Updated 9 months ago
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 9 months ago
- 🚀 Efficient implementations of state-of-the-art linear attention models☆4,474Updated this week
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Jun 6, 2024Updated last year
- ☆22May 5, 2025Updated 10 months ago
- Fork of Flame repo for training of some new stuff in development☆19Feb 27, 2026Updated last week
- ☆134Aug 18, 2025Updated 6 months ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆92Oct 30, 2024Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Oct 5, 2024Updated last year
- ☆19Dec 4, 2025Updated 3 months ago
- Code for "What really matters in matrix-whitening optimizers?"☆22Oct 31, 2025Updated 4 months ago
- Experiments Notebook of "Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism"☆14Apr 30, 2025Updated 10 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆659Updated this week
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆969Feb 5, 2026Updated last month
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 3 weeks ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆255Feb 13, 2026Updated 3 weeks ago
- Combining SOAP and MUON☆19Feb 11, 2025Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆239Jun 15, 2025Updated 8 months ago
- ☆44Nov 1, 2025Updated 4 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆110Oct 11, 2025Updated 4 months ago
- ☆53May 20, 2024Updated last year
- Official implementation of "Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers"☆170Jan 30, 2025Updated last year
- Helpful tools and examples for working with flex-attention☆1,140Feb 8, 2026Updated 3 weeks ago
- ☆24Sep 25, 2024Updated last year
- TileGraph is an experimental DNN compiler that utilizes static code generation and kernel fusion techniques.☆12Sep 18, 2024Updated last year
- ☆118May 19, 2025Updated 9 months ago
- Resa: Transparent Reasoning Models via SAEs☆47Sep 23, 2025Updated 5 months ago
- Stick-breaking attention☆62Jul 1, 2025Updated 8 months ago
- ☆63Oct 3, 2024Updated last year
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"☆88Sep 12, 2025Updated 5 months ago
- My attempt to improve the speed of the newton schulz algorithm, starting from the dion implementation.☆32Dec 5, 2025Updated 3 months ago
- [NeurIPS 2025] Official implementation for our paper "Scaling Diffusion Transformers Efficiently via μP".☆95Nov 2, 2025Updated 4 months ago
- Simple & Scalable Pretraining for Neural Architecture Research☆308Dec 6, 2025Updated 3 months ago
- train with kittens!☆63Oct 25, 2024Updated last year
- A bunch of kernels that might make stuff slower 😉☆75Feb 18, 2026Updated 2 weeks ago
- ☆262Jul 11, 2024Updated last year