PiotrNawrot / sparse-frontierLinks

The evaluation framework for training-free sparse attention in LLMs

☆101

Alternatives and similar repositories for sparse-frontier

Users that are interested in sparse-frontier are comparing it to the libraries listed below

Sorting:

Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆92Updated 3 months ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆119Updated 4 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆195Updated 4 months ago
FasterDecoding / TEAL
☆145Updated 8 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆71Updated 7 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆81Updated 3 months ago
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆120Updated 10 months ago
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆49Updated 6 months ago
Infini-AI-Lab / gsm_infinite
☆55Updated 4 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆93Updated 7 months ago
HanGuo97 / log-linear-attention
☆251Updated 4 months ago
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆147Updated last year
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
samsja / muon_fsdp_2
Muon fsdp 2
☆44Updated 2 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆46Updated 3 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
wdlctc / mini-s
☆52Updated 11 months ago
selfsupervised-ai / Natural-GaLore
An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace
☆18Updated last year
insuhan / hyper-attn
☆83Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
RobertCsordas / moeut
☆86Updated last year
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆97Updated 6 months ago
SalesforceAIResearch / GemFilter
☆85Updated 9 months ago
FasterDecoding / BitDelta
☆202Updated 10 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆147Updated last week