LiyuanLucasLiu / Transformer-ClinicLinks

Understanding the Difficulty of Training Transformers

☆329

Alternatives and similar repositories for Transformer-Clinic

Users that are interested in Transformer-Clinic are comparing it to the libraries listed below

Sorting:

laiguokun / Funnel-Transformer
☆218Updated 5 years ago
guolinke / TUPE
Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve exis…
☆251Updated 3 years ago
cybertronai / pytorch-lamb
Implementation of https://arxiv.org/abs/1904.00962
☆376Updated 4 years ago
richarddwang / electra_pytorch
Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
☆330Updated last year
pmichel31415 / are-16-heads-really-better-than-1
Code for the paper "Are Sixteen Heads Really Better than One?"
☆172Updated 5 years ago
layer6ai-labs / T-Fixup
Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
☆89Updated 4 years ago
facebookresearch / unlikelihood_training
Neural Text Generation with Unlikelihood Training
☆309Updated 3 years ago
microsoft / infinibatch
Efficient, check-pointed data loading for deep learning with massive data sets.
☆208Updated 2 years ago
IntelLabs / academic-budget-bert
Repository containing code for "How to Train BERT with an Academic Budget" paper
☆314Updated last year
lucidrains / routing-transformer
Fully featured implementation of Routing Transformer
☆297Updated 3 years ago
bloodwass / mixout
Implementation of Mixout with PyTorch
☆75Updated 2 years ago
lena-voita / the-story-of-heads
This is a repository with the code for the ACL 2019 paper "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, t…
☆314Updated 4 years ago
graykode / ALBERT-Pytorch
Pytorch Implementation of ALBERT(A Lite BERT for Self-supervised Learning of Language Representations)
☆226Updated 4 years ago
facebookresearch / Mask-Predict
A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a…
☆244Updated 3 years ago
tnq177 / transformers_without_tears
Transformers without Tears: Improving the Normalization of Self-Attention
☆132Updated last year
lucidrains / sinkhorn-transformer
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention
☆267Updated 3 years ago
deep-spin / entmax
The entmax mapping and its loss, a family of sparse softmax alternatives.
☆443Updated last year
asappresearch / revisit-bert-finetuning
For the code release of our arXiv paper "Revisiting Few-sample BERT Fine-tuning" (https://arxiv.org/abs/2006.05987).
☆184Updated 2 years ago
ChunyuanLI / Optimus
Optimus: the first large-scale pre-trained VAE language model
☆390Updated last year
lucidrains / electra-pytorch
A simple and working implementation of Electra, the fastest way to pretrain language models from scratch, in Pytorch
☆227Updated 2 years ago
sacmehta / delight
DeLighT: Very Deep and Light-Weight Transformers
☆470Updated 4 years ago
epfml / collaborative-attention
Code for Multi-Head Attention: Collaborate Instead of Concatenate
☆152Updated 2 years ago
XuezheMax / fairseq-apollo
FairSeq repo with Apollo optimizer
☆114Updated last year
XuezheMax / flowseq
Generative Flow based Sequence-to-Sequence Toolkit written in Python.
☆245Updated 5 years ago
lucidrains / compressive-transformer-pytorch
Pytorch implementation of Compressive Transformers, from Deepmind
☆162Updated 3 years ago
kahne / NonAutoregGenProgress
Tracking the progress in non-autoregressive generation (translation, transcription, etc.)
☆306Updated 2 years ago
microsoft / fastseq
An efficient implementation of the popular sequence models for text generation, summarization, and translation tasks. https://arxiv.org/p…
☆433Updated 2 years ago
facebookresearch / SentAugment
SentAugment is a data augmentation technique for NLP that retrieves similar sentences from a large bank of sentences. It can be used in c…
☆361Updated 3 years ago
uds-lsv / bert-stable-fine-tuning
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
☆136Updated last year
NVIDIA / transformer-ls
Official PyTorch Implementation of Long-Short Transformer (NeurIPS 2021).
☆225Updated 3 years ago