ag1988 / top_k_attentionLinks

The accompanying code for "Memory-efficient Transformers via Top-k Attention" (Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonathan Berant. SustaiNLP 2021).

☆69

Alternatives and similar repositories for top_k_attention

Users that are interested in top_k_attention are comparing it to the libraries listed below

Sorting:

microsoft / Stochastic-Mixture-of-Experts
This package implements THOR: Transformer with Stochastic Experts.
☆65Updated 4 years ago
HKUNLP / efficient-attention
[EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling
☆86Updated 2 years ago
berlino / gated_linear_attention
☆105Updated last year
OpenNLPLab / Transnormer
[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer
☆63Updated 2 years ago
kssteven418 / LTP
[KDD'22] Learned Token Pruning for Transformers
☆100Updated 2 years ago
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆120Updated last year
CyndxAI / QKNorm
Code for the paper "Query-Key Normalization for Transformers"
☆49Updated 4 years ago
BaohaoLiao / mefts
[NeurIPS 2023] Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
☆31Updated 2 years ago
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated 2 years ago
dguo98 / DiffPruning
Parameter Efficient Transfer Learning with Diff Pruning
☆74Updated 4 years ago
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆55Updated 2 years ago
sanagno / adaptively_sparse_attention
☆21Updated 2 years ago
pkuzengqi / Skyformer
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)
☆63Updated 3 years ago
Noahs-ARK / RFA
☆33Updated 4 years ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆29Updated last year
ylsung / vl-merging
PyTorch codes for the paper "An Empirical Study of Multimodal Model Merging"
☆37Updated 2 years ago
cambridgeltl / autopeft
AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning (Zhou et al.; TACL 2024)
☆47Updated last year
varunnair18 / FISH
Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).
☆59Updated 3 years ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
rabeehk / compacter
☆130Updated 3 years ago
Hunter-DDM / stablemoe
Code for the ACL-2022 paper "StableMoE: Stable Routing Strategy for Mixture of Experts"
☆50Updated 3 years ago
XuezheMax / fairseq-apollo
FairSeq repo with Apollo optimizer
☆114Updated last year
VITA-Group / EarlyBERT
[ACL-IJCNLP 2021] "EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets" by Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, …
☆18Updated 3 years ago
RobertCsordas / linear_layer_as_attention
The official repository for our paper "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns …
☆16Updated 4 months ago
ahennequ / pytorch-custom-mma
☆29Updated 3 years ago
microsoft / EfficientLongSequenceModeling
☆51Updated 2 years ago
jungokasai / T2R
☆14Updated 2 years ago
huggingface / block_movement_pruning
Block Sparse movement pruning
☆81Updated 4 years ago
renll / SeqBoat
[NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling
☆39Updated last year
jzbjyb / ReAtt
Retrieval as Attention
☆82Updated 2 years ago