Noahs-ARK / RFA
☆32Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for RFA
- Code to reproduce the results for Compositional Attention☆60Updated 2 years ago
- ☆21Updated 2 years ago
- [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling☆79Updated last year
- Mixture of Attention Heads☆39Updated 2 years ago
- Parameter Efficient Transfer Learning with Diff Pruning☆72Updated 3 years ago
- ☆30Updated last year
- Learning to Encode Position for Transformer with Continuous Dynamical Model☆59Updated 4 years ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆44Updated last year
- ☆15Updated 3 years ago
- ☆17Updated last year
- ☆49Updated last year
- ☆21Updated last year
- A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643☆69Updated last year
- This package implements THOR: Transformer with Stochastic Experts.☆61Updated 3 years ago
- Code for the paper "Query-Key Normalization for Transformers"☆35Updated 3 years ago
- Code for the paper PermuteFormer☆42Updated 3 years ago
- (ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.☆21Updated 2 years ago
- The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".☆32Updated 3 years ago
- [ICML 2022] Latent Diffusion Energy-Based Model for Interpretable Text Modeling☆63Updated 2 years ago
- The official repository for our paper "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns …☆16Updated last year
- Implementation of QKVAE☆11Updated last year
- Code for the ACL-2022 paper "StableMoE: Stable Routing Strategy for Mixture of Experts"☆42Updated 2 years ago
- Official repository for the paper "Going Beyond Linear Transformers with Recurrent Fast Weight Programmers" (NeurIPS 2021)☆47Updated last year
- The accompanying code for "Memory-efficient Transformers via Top-k Attention" (Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonatha…☆60Updated 3 years ago
- STABILIZING GRADIENTS FOR DEEP NEURAL NETWORKS VIA EFFICIENT SVD PARAMETERIZATION☆16Updated 6 years ago
- Curse-of-memory phenomenon of RNNs in sequence modelling☆19Updated this week
- Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"☆22Updated 3 weeks ago
- Unofficial PyTorch implementation of "Step-unrolled Denoising Autoencoders for Text Generation"☆23Updated 2 years ago
- Gradient-based Hyperparameter Optimization Over Long Horizons☆12Updated 3 years ago
- Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)☆59Updated 2 years ago