sail-sg / winLinks

☆9

Alternatives and similar repositories for win

Users that are interested in win are comparing it to the libraries listed below

Sorting:

zqOuO / GWT
☆13Updated 6 months ago
TianjinYellow / SPAM-Optimizer
☆34Updated 4 months ago
lzhangbv / eva
[ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation
☆12Updated 2 years ago
LIONS-EPFL / scion
☆33Updated 3 weeks ago
MarlonBecker / MSAM
☆19Updated last year
AngusDujw / SAF
☆36Updated 2 years ago
zyushun / hessian-spectrum
Code for the paper: Why Transformers Need Adam: A Hessian Perspective
☆60Updated 4 months ago
ethansmith2000 / fsdp_optimizers
supporting pytorch FSDP for optimizers
☆84Updated 8 months ago
fangyuan-ksgk / selective-attention-transformer
Unofficial Implementation of Selective Attention Transformer
☆17Updated 9 months ago
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆66Updated 10 months ago
dayal-kalra / low-memory-adam
☆11Updated 5 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆118Updated last month
wjxts / RegularizedBN
☆21Updated 2 years ago
cloneofsimo / ezmup
Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam
☆84Updated last year
Adamdad / rational_kat_cu
☆70Updated 6 months ago
krafton-ai / mambaformer-icl
MambaFormer in-context learning experiments and implementation for https://arxiv.org/abs/2402.04248
☆55Updated last year
nikhilvyas / SOAP
☆206Updated 8 months ago
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated last year
proger / hippogriff
Griffin MQA + Hawk Linear RNN Hybrid
☆88Updated last year
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆110Updated 7 months ago
shikaiqiu / compute-better-spent
☆53Updated 10 months ago
deep-spin / adasplash
AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)
☆19Updated 3 weeks ago
Leiay / looped_transformer
☆31Updated last year
fkodom / soft-mixture-of-experts
PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)
☆76Updated last year
nblt / F-SAM
[CVPR 2024] Friendly Sharpness-Aware Minimization
☆34Updated 9 months ago
berlino / gated_linear_attention
☆106Updated last year
andyjm3 / SLTrain
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)
☆32Updated 9 months ago
lmsdss / LayerNorm-Scaling
Official Pytorch Implementation of "The Curse of Depth in Large Language Models" by Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin,Yefen…
☆55Updated 2 weeks ago
sustcsonglin / flash-linear-rnn
Implementations of various linear RNN layers using pytorch and triton
☆53Updated 2 years ago
ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
☆94Updated 11 months ago