The evaluation framework for training-free sparse attention in LLMs
☆122Jan 27, 2026Updated last month
Alternatives and similar repositories for sparse-frontier
Users that are interested in sparse-frontier are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆91Jul 17, 2025Updated 8 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆273Jul 6, 2025Updated 8 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆164Oct 13, 2025Updated 5 months ago
- ☆18Mar 11, 2025Updated last year
- The Official Implementation of Ada-KV [NeurIPS 2025]☆128Nov 26, 2025Updated 4 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- KV cache compression for high-throughput LLM inference☆153Feb 5, 2025Updated last year
- Research work aimed at addressing the problem of modeling infinite-length context☆48Dec 18, 2025Updated 3 months ago
- ☆136May 29, 2025Updated 9 months ago
- Official Implementation for [ICLR26] DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference☆31Mar 19, 2026Updated last week
- This is the official repo for the paper "Accelerating Parallel Sampling of Diffusion Models" Tang et al. ICML 2024 https://openreview.net…☆16Jul 19, 2024Updated last year
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆132Jun 24, 2025Updated 9 months ago
- The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Proces…☆85Jan 12, 2025Updated last year
- ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)☆27Feb 26, 2026Updated last month
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆476May 17, 2025Updated 10 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆144Feb 25, 2026Updated last month
- [VLDB 26, NeurIPS 25] Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆133Feb 22, 2026Updated last month
- Fork of Flame repo for training of some new stuff in development☆19Mar 17, 2026Updated last week
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆286May 1, 2025Updated 10 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's l…☆56Mar 19, 2026Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆532Feb 10, 2025Updated last year
- Customized Inference Engine for Multiverse Models☆25Jun 27, 2025Updated 8 months ago
- ☆20May 30, 2024Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆58Nov 20, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Code and data for paper "(How) do Language Models Track State?"☆22Mar 31, 2025Updated 11 months ago
- ☆39Oct 11, 2025Updated 5 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆22Sep 19, 2024Updated last year
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆110Oct 11, 2025Updated 5 months ago
- ☆234Nov 19, 2025Updated 4 months ago
- Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".☆16Sep 15, 2024Updated last year
- Official Code Repository for the paper "Key-value memory in the brain"☆31Feb 25, 2025Updated last year
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"☆88Sep 12, 2025Updated 6 months ago
- Showing how to use CUDA on google colab☆13Feb 24, 2025Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆14Oct 3, 2024Updated last year
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆29Jul 24, 2025Updated 8 months ago
- Unofficial implementation of paper : Exploring the Space of Key-Value-Query Models with Intention☆12May 24, 2023Updated 2 years ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆209Feb 11, 2026Updated last month
- Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence☆59Nov 11, 2025Updated 4 months ago
- CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval☆24Jun 28, 2025Updated 8 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆374Dec 12, 2024Updated last year