sacmehta / delightLinks

DeLighT: Very Deep and Light-Weight Transformers

☆468

Alternatives and similar repositories for delight

Users that are interested in delight are comparing it to the libraries listed below

Sorting:

mit-han-lab / lite-transformer
[ICLR 2020] Lite Transformer with Long-Short Range Attention
☆610Updated last year
tatp22 / linformer-pytorch
My take on a practical implementation of Linformer for Pytorch.
☆420Updated 3 years ago
lucidrains / routing-transformer
Fully featured implementation of Routing Transformer
☆296Updated 3 years ago
majumderb / rezero
Official PyTorch Repo for "ReZero is All You Need: Fast Convergence at Large Depth"
☆415Updated last year
cybertronai / pytorch-lamb
Implementation of https://arxiv.org/abs/1904.00962
☆377Updated 4 years ago
LiyuanLucasLiu / Transformer-Clinic
Understanding the Difficulty of Training Transformers
☆330Updated 3 years ago
lucidrains / sinkhorn-transformer
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention
☆268Updated 4 years ago
laiguokun / Funnel-Transformer
☆219Updated 5 years ago
NVIDIA / transformer-ls
Official PyTorch Implementation of Long-Short Transformer (NeurIPS 2021).
☆228Updated 3 years ago
lucidrains / g-mlp-pytorch
Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch
☆430Updated 4 years ago
yitu-opensource / ConvBert
☆254Updated 3 years ago
guolinke / TUPE
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve exis…
☆252Updated 3 years ago
facebookresearch / adaptive-span
Transformer training code for sequential tasks
☆610Updated 4 years ago
PhilJd / contiguous_pytorch_params
Accelerate training by storing parameters in one contiguous chunk of memory.
☆293Updated 4 years ago
lukemelas / do-you-even-need-attention
Is the attention layer even necessary? (https://arxiv.org/abs/2105.02723)
☆485Updated 4 years ago
rishikksh20 / FNet-pytorch
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
☆259Updated 4 years ago
XuezheMax / apollo
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
☆182Updated 3 years ago
microsoft / fastformers
FastFormers - highly efficient transformer models for NLU
☆707Updated 7 months ago
yangkky / distributed_tutorial
☆261Updated 6 years ago
JetRunner / BERT-of-Theseus
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).
☆315Updated 2 years ago
lena-voita / the-story-of-heads
This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, t…
☆314Updated 4 years ago
OpenNLPLab / cosFormer
[ICLR 2022] Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention
☆196Updated 2 years ago
lucidrains / performer-pytorch
An implementation of Performer, a linear attention-based transformer, in Pytorch
☆1,154Updated 3 years ago
graykode / ALBERT-Pytorch
Pytorch Implementation of ALBERT(A Lite BERT for Self-supervised Learning of Language Representations)
☆227Updated 4 years ago
epfml / collaborative-attention
Code for Multi-Head Attention: Collaborate Instead of Concatenate
☆151Updated 2 years ago
alphadl / lookahead.pytorch
lookahead optimizer (Lookahead Optimizer: k steps forward, 1 step back) for pytorch
☆337Updated 6 years ago
idiap / fast-transformers
Pytorch library for fast transformer implementations
☆1,744Updated 2 years ago
budzianowski / PyTorch-Beam-Search-Decoding
PyTorch implementation of beam search decoding for seq2seq models
☆339Updated 2 years ago
rish-16 / aft-pytorch
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.
☆243Updated 3 years ago
karanchahal / distiller
A large scale study of Knowledge Distillation.
☆220Updated 5 years ago