ItzikMalkiel / MTAdamLinks

MTAdam: Automatic Balancing of Multiple Training Loss Terms

☆36

Alternatives and similar repositories for MTAdam

Users that are interested in MTAdam are comparing it to the libraries listed below

Sorting:

izmailovpavel / torch_swa_examples
☆47Updated 4 years ago
sIncerass / powernorm
[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845
☆120Updated 4 years ago
epfml / collaborative-attention
Code for Multi-Head Attention: Collaborate Instead of Concatenate
☆152Updated 2 years ago
tbachlechner / ReZero-examples
PyTorch Examples repo for "ReZero is All You Need: Fast Convergence at Large Depth"
☆62Updated last year
giannisdaras / smyrf
[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".
☆50Updated 2 years ago
yaohungt / TransformerDissection
[EMNLP'19] Summary for Transformer Understanding
☆53Updated 6 years ago
ssnl / PyTorch-Reparam-Module
Reparameterize your PyTorch modules
☆71Updated 4 years ago
Holmeswww / PPOGAN
☆25Updated last year
j-min / Dropouts
PyTorch Implementations of Dropout Variants
☆88Updated 7 years ago
ischlag / fast-weight-transformers
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.
☆110Updated 4 years ago
JingzhaoZhang / why-clipping-accelerates
A pytorch implementation for the LSTM experiments in the paper: Why Gradient Clipping Accelerates Training: A Theoretical Justification f…
☆46Updated 5 years ago
dbaranchuk / memory-efficient-maml
Memory efficient MAML using gradient checkpointing
☆86Updated 5 years ago
takashiishida / flooding
[ICML 2020] code for the flooding regularizer proposed in "Do We Need Zero Training Loss After Achieving Zero Training Error?"
☆95Updated 2 years ago
lucidrains / axial-positional-embedding
Axial Positional Embedding for Pytorch
☆84Updated 9 months ago
lucidrains / hamburger-pytorch
Pytorch implementation of the hamburger module from the ICLR 2021 paper "Is Attention Better Than Matrix Decomposition"
☆99Updated 4 years ago
yedidh / glann
Official code for paper "Non-Adversarial Image Synthesis with Generative Latent Nearest Neighbors"
☆28Updated 6 years ago
david-abel / neurips_2019
Notes from NeurIPS 2019
☆29Updated 5 years ago
msobroza / SparsemaxPytorch
SparseMax activation function implementation (ICML 2016) (PyTorch)
☆28Updated 8 years ago
salesforce / NeuralBayes
☆24Updated 7 months ago
Zhiyuan1991 / proVLAE
☆27Updated 5 years ago
lucidrains / distilled-retriever-pytorch
Implementation of the retriever distillation procedure as outlined in the paper "Distilling Knowledge from Reader to Retriever"
☆32Updated 4 years ago
yaohungt / Adaptive-Regularization-Neural-Network
[NeurIPS'19] [PyTorch] Adaptive Regularization in NN
☆68Updated 6 years ago
lancopku / AdaNorm
Code for "Understanding and Improving Layer Normalization"
☆46Updated 5 years ago
lucidrains / isab-pytorch
An implementation of (Induced) Set Attention Block, from the Set Transformers paper
☆65Updated 2 years ago
10-zin / Synthesizer
A PyTorch implementation of the paper - "Synthesizer: Rethinking Self-Attention in Transformer Models"
☆73Updated 2 years ago
matthewmackay / reversible-rnn
Code for reversible recurrent neural networks
☆40Updated 6 years ago
layer6ai-labs / T-Fixup
Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
☆89Updated 4 years ago
RedRyan111 / GLOM
An implementation of 2021 paper by Geoffrey Hinton: "How to represent part-whole hierarchies in a neural network" in Pytorch.
☆57Updated 4 years ago
lucidrains / long-short-transformer
Implementation of Long-Short Transformer, combining local and global inductive biases for attention over long sequences, in Pytorch
☆120Updated 4 years ago
Separius / CudaRelativeAttention
custom cuda kernel for {2, 3}d relative attention with pytorch wrapper
☆43Updated 5 years ago