bdusell / stack-attention
Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"
☆17Updated last year
Alternatives and similar repositories for stack-attention
Users that are interested in stack-attention are comparing it to the libraries listed below
Sorting:
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weights☆19Updated 2 years ago
- ☆20Updated 11 months ago
- ☆32Updated last year
- Efficient Scaling laws and collaborative pretraining.☆16Updated 3 months ago
- Official code for the paper "Attention as a Hypernetwork"☆33Updated 10 months ago
- [ICLR'25] "Understanding Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing" by Peihao Wang, Ruisi Cai, Yue…☆11Updated last month
- Curse-of-memory phenomenon of RNNs in sequence modelling