NVlabs / GatedDeltaNetLinks
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆167Updated 2 months ago
Alternatives and similar repositories for GatedDeltaNet
Users that are interested in GatedDeltaNet are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆104Updated 2 weeks ago
- ☆74Updated 3 months ago
- Some preliminary explorations of Mamba's context scaling.☆212Updated last year
- Normalized Transformer (nGPT)☆181Updated 6 months ago
- 🔥 A minimal training framework for scaling FLA models☆146Updated 3 weeks ago
- [ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization☆69Updated 4 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆234Updated 3 months ago
- Stick-breaking attention☆56Updated 2 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆221Updated last month
- ☆286Updated last month
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆124Updated 9 months ago
- Accelerated First Order Parallel Associative Scan☆181Updated 9 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆158Updated 3 weeks ago
- Griffin MQA + Hawk Linear RNN Hybrid☆86Updated last year
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI☆282Updated 2 months ago
- Official implementation of "Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers"☆135Updated 4 months ago
- Attempt to make multiple residual streams from Bytedance's Hyper-Connections paper accessible to the public☆83Updated 3 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆108Updated 8 months ago
- Efficient triton implementation of Native Sparse Attention.☆155Updated last week
- [ICLR 2025 Spotlight] Official Implementation for ToST (Token Statistics Transformer)☆88Updated 3 months ago
- When it comes to optimizers, it's always better to be safe than sorry☆233Updated 2 months ago
- Implementation of the proposed MaskBit from Bytedance AI☆80Updated 6 months ago
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆56Updated 2 weeks ago
- Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".☆124Updated last week
- FlashRNN - Fast RNN Kernels with I/O Awareness☆90Updated 2 months ago
- Fast and memory-efficient exact attention☆68Updated 3 months ago
- Understand and test language model architectures on synthetic tasks.☆197Updated 2 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆117Updated 9 months ago
- Triton implement of bi-directional (non-causal) linear attention☆48Updated 4 months ago
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"☆173Updated 11 months ago