TianjinYellow / SPAM-Optimizer
☆25Updated last month
Alternatives and similar repositories for SPAM-Optimizer:
Users that are interested in SPAM-Optimizer are comparing it to the libraries listed below
- DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)☆23Updated 7 months ago
- Official PyTorch Implementation for Paper "No More Adam: Learning Rate Scaling at Initialization is All You Need"☆49Updated last month
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆29Updated 8 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆137Updated last week
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆36Updated 4 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆96Updated 5 months ago
- ☆27Updated last month
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated 11 months ago
- Official PyTorch Implementation of the Longhorn Deep State Space Model☆48Updated 3 months ago
- [ICLR2025] DiffuGPT and DiffuLLaMA: Scaling Diffusion Language Models via Adaptation from Autoregressive Models☆102Updated last week
- ☆73Updated 6 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆25Updated 5 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆119Updated 6 months ago
- When it comes to optimizers, it's always better to be safe than sorry☆180Updated last week
- Implementation of the proposed MaskBit from Bytedance AI☆75Updated 3 months ago
- The codebase of our paper "Improving the Training of Rectified Flows", NeurIPS 2024☆97Updated 4 months ago
- Minimal Implementation of Visual Autoregressive Modelling (VAR)☆28Updated last month
- Stick-breaking attention☆44Updated last month
- Implementation of Infini-Transformer in Pytorch☆109Updated 2 months ago
- Implementation of a multimodal diffusion transformer in Pytorch☆100Updated 8 months ago
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆59Updated 2 months ago
- Triton implement of bi-directional (non-causal) linear attention☆42Updated last month
- 🔥 A minimal training framework for scaling FLA models☆73Updated this week
- Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxiang Li, Lu Yi…☆17Updated 2 months ago
- DPO, but faster 🚀☆40Updated 2 months ago
- ☆13Updated 3 months ago
- Official code for the paper "Attention as a Hypernetwork"☆24Updated 8 months ago
- ☆100Updated 11 months ago