JeanKaddour / NoTrainNoGainLinks

Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)

☆80

Alternatives and similar repositories for NoTrainNoGain

Users that are interested in NoTrainNoGain are comparing it to the libraries listed below

Sorting:

abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆70Updated last year
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆77Updated 9 months ago
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆97Updated last year
berlino / seq_icl
☆53Updated last year
MadryLab / DsDm
☆50Updated last year
JeanKaddour / LAWA
Latest Weight Averaging (NeurIPS HITY 2022)
☆31Updated 2 years ago
varunnair18 / FISH
Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).
☆59Updated 3 years ago
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆136Updated last year
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆53Updated 2 years ago
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆66Updated 10 months ago
sustcsonglin / mamba-triton
☆49Updated last year
VITA-Group / Junk_DNA_Hypothesis
[ICML 2024] Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity; Lu Yin*, Ajay Jaiswal*, Shiwei Liu, So…
☆16Updated 3 months ago
sjelassi / transformers_ssm_copy
☆33Updated last year
mmatena / model_merging
☆71Updated 3 years ago
insuhan / hyper-attn
☆81Updated last year
gregorbachmann / Next-Token-Failures
☆88Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆44Updated 10 months ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆30Updated last year
berlino / gated_linear_attention
☆106Updated last year
formll / resolving-scaling-law-discrepancies
☆20Updated last year
princeton-nlp / LM-Kernel-FT
A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
☆78Updated last year
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
GSYfate / knnlm-limits
Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"
☆23Updated 3 months ago
hadasah / btm
☆75Updated last year
smonsays / hypernetwork-attention
Official code for the paper "Attention as a Hypernetwork"
☆40Updated last year
microsoft / Stochastic-Mixture-of-Experts
This package implements THOR: Transformer with Stochastic Experts.
☆65Updated 3 years ago
VITA-Group / SMC-Bench
[ICLR 2023] "Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together!" Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen…
☆28Updated last year
r-three / realistic_evaluation_of_model_merging_for_compositional_generalization
☆12Updated 9 months ago
HazyResearch / prefix-linear-attention
☆56Updated last year