NVlabs / GatedDeltaNetLinks

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

☆382

Alternatives and similar repositories for GatedDeltaNet

Users that are interested in GatedDeltaNet are comparing it to the libraries listed below

Sorting:

HanGuo97 / log-linear-attention
☆256Updated 6 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆212Updated 5 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆314Updated 3 weeks ago
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆251Updated 6 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆134Updated last month
apple / ml-sigmoid-attention
☆303Updated 7 months ago
tensorgi / TPA
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)
☆427Updated last month
sustcsonglin / linear-attention-and-beyond-slides
☆99Updated 9 months ago
lucidrains / native-sparse-attention-pytorch
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
☆785Updated 3 months ago
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆335Updated 9 months ago
lucidrains / nGPT-pytorch
Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI
☆294Updated 6 months ago
zhijie-group / Discrete-Diffusion-Forcing
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference
☆205Updated 2 months ago
yuezhouhu / 2by4-pretrain
Efficient 2:4 sparse training algorithms and implementations
☆57Updated 11 months ago
jxiw / MambaInLlama
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
☆232Updated last month
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆713Updated last week
Dao-AILab / grouped-latent-attention
☆132Updated 6 months ago
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆217Updated last year
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆194Updated last year
mit-han-lab / flash-moba
☆201Updated 2 weeks ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆151Updated last year
TsinghuaC3I / Fourier-Position-Embedding
[ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
☆104Updated 6 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆256Updated 4 months ago
qiuzh20 / gated_attention
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…
☆273Updated 2 months ago
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆169Updated last month
fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆928Updated 8 months ago
zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
☆445Updated 6 months ago
MuLabPKU / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)
☆413Updated 2 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆74Updated 9 months ago
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆548Updated 6 months ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆275Updated 3 weeks ago