TianjinYellow / SPAM-OptimizerLinks

☆34

Alternatives and similar repositories for SPAM-Optimizer

Users that are interested in SPAM-Optimizer are comparing it to the libraries listed below

Sorting:

zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆118Updated last month
zqOuO / GWT
☆13Updated 6 months ago
RobertCsordas / moeut
☆83Updated 11 months ago
kuleshov-group / remdm
Remasking Discrete Diffusion Models with Inference-Time Scaling
☆36Updated 4 months ago
IST-DASLab / QuEST
Work in progress.
☆70Updated last month
Infini-AI-Lab / Multiverse
☆81Updated last week
pixeli99 / MixLN
[ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…
☆25Updated last week
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆86Updated last month
abdelfattah-lab / TokenButler
☆23Updated last week
lmsdss / LayerNorm-Scaling
Official Pytorch Implementation of "The Curse of Depth in Large Language Models" by Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin,Yefen…
☆55Updated last week
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆47Updated 3 months ago
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆127Updated 11 months ago
deep-spin / adasplash
AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)
☆19Updated 3 weeks ago
Cranial-XIX / longhorn
Official PyTorch Implementation of the Longhorn Deep State Space Model
☆54Updated 8 months ago
fangyuan-ksgk / selective-attention-transformer
Unofficial Implementation of Selective Attention Transformer
☆17Updated 9 months ago
RWKV / RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…
☆51Updated 4 months ago
berlino / gated_linear_attention
☆106Updated last year
kyleliang919 / Online-Subspace-Descent
[NeurIPS 2024] Low rank memory efficient optimizer without SVD
☆30Updated last month
horseee / dKV-Cache
☆88Updated 2 months ago
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆67Updated last year
Anonymous1252022 / Megatron-DeepSpeed
☆12Updated 10 months ago
chuanyang-Zheng / DAPE
The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"
☆38Updated 9 months ago
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
assafbk / DeciMamba
DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)
☆28Updated 3 months ago
ldery / Bonsai
Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"
☆28Updated last year
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆33Updated last year
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆193Updated 4 months ago
wang-kee / LiNeS
Official repository of "LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging"
☆30Updated 9 months ago
andyjm3 / SLTrain
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)
☆32Updated 9 months ago