lucidrains / routing-transformerLinks

Fully featured implementation of Routing Transformer

☆298

Alternatives and similar repositories for routing-transformer

Users that are interested in routing-transformer are comparing it to the libraries listed below

Sorting:

lucidrains / sinkhorn-transformer
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention
☆269Updated 4 years ago
tatp22 / linformer-pytorch
My take on a practical implementation of Linformer for Pytorch.
☆421Updated 3 years ago
NVIDIA / transformer-ls
Official PyTorch Implementation of Long-Short Transformer (NeurIPS 2021).
☆228Updated 3 years ago
LiyuanLucasLiu / Transformer-Clinic
Understanding the Difficulty of Training Transformers
☆332Updated 3 years ago
lucidrains / linformer
Implementation of Linformer for Pytorch
☆302Updated last year
lucidrains / compressive-transformer-pytorch
Pytorch implementation of Compressive Transformers, from Deepmind
☆163Updated 4 years ago
sacmehta / delight
DeLighT: Very Deep and Light-Weight Transformers
☆468Updated 5 years ago
rishikksh20 / FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
☆260Updated 4 years ago
mlpen / Nystromformer
☆387Updated 2 years ago
lucidrains / feedback-transformer-pytorch
Implementation of Feedback Transformer in Pytorch
☆108Updated 4 years ago
cybertronai / pytorch-lamb
Implementation of https://arxiv.org/abs/1904.00962
☆377Updated 4 years ago
laiguokun / Funnel-Transformer
☆219Updated 5 years ago
epfml / collaborative-attention
Code for Multi-Head Attention: Collaborate Instead of Concatenate
☆152Updated 2 years ago
lucidrains / long-short-transformer
Implementation of Long-Short Transformer, combining local and global inductive biases for attention over long sequences, in Pytorch
☆120Updated 4 years ago
sIncerass / powernorm
[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845
☆120Updated 4 years ago
facebookresearch / transformer-sequential
Trains Transformer model variants. Data isn't shuffled between batches.
☆143Updated 3 years ago
deep-spin / entmax
The entmax mapping and its loss, a family of sparse softmax alternatives.
☆452Updated last year
teddykoker / image-gpt
PyTorch Implementation of OpenAI's Image GPT
☆260Updated 2 years ago
lucidrains / memformer
Implementation of Memformer, a Memory-augmented Transformer, in Pytorch
☆125Updated 5 years ago
rish-16 / aft-pytorch
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.
☆243Updated 3 years ago
lucidrains / h-transformer-1d
Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning
☆166Updated last year
lucidrains / local-attention
An implementation of local windowed attention for language modeling
☆488Updated 4 months ago
lukemelas / do-you-even-need-attention
Is the attention layer even necessary? (https://arxiv.org/abs/2105.02723)
☆483Updated 4 years ago
layer6ai-labs / T-Fixup
Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
☆89Updated 4 years ago
facebookresearch / mega
Sequence modeling with Mega.
☆301Updated 2 years ago
kzl / universal-computation
Official codebase for Pretrained Transformers as Universal Computation Engines.
☆247Updated 3 years ago
KrisKorrel / sparsemax-pytorch
Implementation of Sparsemax activation in Pytorch
☆166Updated 5 years ago
twistedcubic / attention-rank-collapse
[ICML 2021 Oral] We show pure attention suffers rank collapse, and how different mechanisms combat it.
☆168Updated 4 years ago
lucidrains / Mega-pytorch
Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena
☆207Updated 2 years ago
lucidrains / fast-transformer-pytorch
Implementation of Fast Transformer in Pytorch
☆177Updated 4 years ago