Zyphra / tree_attention
View external linksLinks

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

☆132

Alternatives and similar repositories for tree_attention

Users that are interested in tree_attention are comparing it to the libraries listed below

Sorting:

DeepAuto-AI / sglang
View on GitHub
This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.
☆18Dec 23, 2025Updated last month
OpenNLPLab / LASP
View on GitHub
Linear Attention Sequence Parallelism (LASP)
☆88Jun 4, 2024Updated last year
FasterDecoding / SnapKV
View on GitHub
☆303Jul 10, 2025Updated 7 months ago
GindaChen / FlexFlashAttention3
View on GitHub
FlexAttention w/ FlashAttention3 Support
☆27Oct 5, 2024Updated last year
zhuzilin / ring-flash-attention
View on GitHub
Ring attention implementation with flash attention
☆980Sep 10, 2025Updated 5 months ago
Leooyii / LCEG
View on GitHub
Long Context Extension and Generalization in LLMs
☆62Sep 21, 2024Updated last year
NVIDIA / kvpress
View on GitHub
LLM KV cache compression made easy
☆876Jan 28, 2026Updated 2 weeks ago
TRI-ML / linear_open_lm
View on GitHub
A repository for research on medium sized language models.
☆77May 23, 2024Updated last year
arcee-ai / DAM
View on GitHub
☆56Nov 6, 2024Updated last year
Wenda302 / IsarStep
View on GitHub
Code and dataset for the paper "IsarStep: a Benchmark for High-level Mathematical Reasoning"
☆12Mar 15, 2021Updated 4 years ago
catid / lllm
View on GitHub
Latent Large Language Models
☆19Aug 24, 2024Updated last year
whyNLP / LCKV
View on GitHub
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆157Apr 7, 2025Updated 10 months ago
HazyResearch / lolcats
View on GitHub
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆251Jan 31, 2025Updated last year
haoliuhl / ringattention
View on GitHub
Large Context Attention
☆766Oct 13, 2025Updated 4 months ago
allenai / signal-and-noise
View on GitHub
Measuring the Signal to Noise Ratio in Language Model Evaluation
☆28Aug 19, 2025Updated 5 months ago
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,183Sep 30, 2025Updated 4 months ago
PrimeIntellect-ai / OpenDiloco
View on GitHub
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
☆562Jan 13, 2025Updated last year
sheryc / resonance_rope
View on GitHub
[ACL 24 Findings] Implementation of Resonance RoPE and the PosGen synthetic dataset.
☆24Mar 5, 2024Updated last year
Baichenjia / COPO
View on GitHub
Online Preference Alignment for Language Models via Count-based Exploration
☆17Jan 14, 2025Updated last year
fla-org / native-sparse-attention
View on GitHub
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆964Feb 5, 2026Updated last week
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆250Feb 5, 2026Updated last week
BlinkDL / fast.c
View on GitHub
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆74Feb 2, 2025Updated last year
Zyphra / zcookbook
View on GitHub
Training hybrid models for dummies.
☆29Nov 1, 2025Updated 3 months ago
apple / ml-dataset-decomposition
View on GitHub
Official repo of dataset-decomposition paper [NeurIPS 2024]
☆21Jan 8, 2025Updated last year
NVIDIA / Star-Attention
View on GitHub
Efficient LLM Inference over Long Sequences
☆394Jun 25, 2025Updated 7 months ago
gpu-mode / ring-attention
View on GitHub
ring-attention experiments
☆165Oct 17, 2024Updated last year
OpenNLPLab / lightning-attention
View on GitHub
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆341Feb 23, 2025Updated 11 months ago
itsnamgyu / block-transformer
View on GitHub
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆163Apr 13, 2025Updated 10 months ago
microsoft / SparseMixer
View on GitHub
Sparse Backpropagation for Mixture-of-Expert Training
☆29Jul 2, 2024Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆524Feb 10, 2025Updated last year
sail-sg / SimLayerKV
View on GitHub
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆52Oct 18, 2024Updated last year
gisilvs / AEF
View on GitHub
☆33Mar 1, 2023Updated 2 years ago
Zyphra / BlackMamba
View on GitHub
Code repository for Black Mamba
☆261Feb 8, 2024Updated 2 years ago
cloneofsimo / min-max-gpt
View on GitHub
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Apr 17, 2024Updated last year
HomebrewML / Olmax
View on GitHub
HomebrewNLP in JAX flavour for maintable TPU-Training
☆51Jan 20, 2024Updated 2 years ago
philippe-eecs / vitok
View on GitHub
☆34May 14, 2025Updated 9 months ago
nshepperd / flash_attn_jax
View on GitHub
JAX bindings for Flash Attention v2
☆103Feb 5, 2026Updated last week
berlino / seq_icl
View on GitHub
☆53May 20, 2024Updated last year
LCM-Lab / LOGO
View on GitHub
Code for paper: Long cOntext aliGnment via efficient preference Optimization
☆24Oct 10, 2025Updated 4 months ago

Zyphra / tree_attentionView external linksLinks

Alternatives and similar repositories for tree_attention

Zyphra / tree_attention
View external linksLinks