zhixuan-lin / forgetting-transformerLinks
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆125Updated last month
Alternatives and similar repositories for forgetting-transformer
Users that are interested in forgetting-transformer are comparing it to the libraries listed below
Sorting:
- ☆84Updated 6 months ago
- ☆35Updated 6 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆39Updated 11 months ago
- Stick-breaking attention☆60Updated 2 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆228Updated 5 months ago
- Here we will test various linear attention designs.☆62Updated last year
- ☆242Updated 3 months ago
- Triton implement of bi-directional (non-causal) linear attention☆54Updated 7 months ago
- ☆106Updated last year
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆102Updated last year
- Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…