TianjinYellow / SPAM-OptimizerLinks
☆34Updated 7 months ago
Alternatives and similar repositories for SPAM-Optimizer
Users that are interested in SPAM-Optimizer are comparing it to the libraries listed below
Sorting:
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆131Updated last month
- ☆13Updated 9 months ago
- ☆86Updated last year
- [NeurIPS '25] Multi-Token Prediction Needs Registers☆22Updated last month
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)☆26Updated 3 weeks ago
- Remasking Discrete Diffusion Models with Inference-Time Scaling☆49Updated 7 months ago
- Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025☆31Updated 5 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆39Updated last year
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆129Updated last year
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆28Updated last year
- The official github repo for "Diffusion Language Models are Super Data Learners".☆135Updated 3 weeks ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆104Updated last year
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆36Updated last year
- Official Pytorch Implementation of "The Curse of Depth in Large Language Models" by Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin,Yefen…☆60Updated last week
- Kinetics: Rethinking Test-Time Scaling Laws☆81Updated 3 months ago
- Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion mod…☆101Updated 2 months ago
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆51Updated 6 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆116Updated last year
- Unofficial Implementation of Selective Attention Transformer☆17Updated 11 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆108Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆197Updated 4 months ago
- DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)☆31Updated 6 months ago
- Official code for the paper "Attention as a Hypernetwork"☆44Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆84Updated 11 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆121Updated 4 months ago
- M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models☆44Updated 3 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆36Updated last year
- Work in progress.☆74Updated 3 months ago
- ☆98Updated last month
- Here we will test various linear attention designs.☆61Updated last year