The AdEMAMix Optimizer: Better, Faster, Older.
☆186Sep 12, 2024Updated last year
Alternatives and similar repositories for AdEMAMix-Optimizer-Pytorch
Users that are interested in AdEMAMix-Optimizer-Pytorch are comparing it to the libraries listed below
Sorting:
- ☆307Apr 23, 2025Updated 10 months ago
- ☆70Nov 15, 2024Updated last year
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 9 months ago
- ☆19Jan 10, 2025Updated last year
- GoldFinch and other hybrid transformer components☆12Dec 9, 2025Updated 2 months ago
- ☆35Mar 12, 2025Updated 11 months ago
- Grams: Gradient Descent with Adaptive Momentum Scaling (ICLR 2025 Workshop)☆17Mar 6, 2025Updated 11 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793☆453May 13, 2025Updated 9 months ago
- Accelerated First Order Parallel Associative Scan☆195Jan 7, 2026Updated last month
- ☆40Jan 5, 2024Updated 2 years ago
- ☆252Dec 2, 2024Updated last year
- Schedule free optimiser implemented in JAX using Optimistix☆15May 29, 2024Updated last year
- Official repository for the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"☆577Jun 28, 2024Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Oct 5, 2024Updated last year
- GoldFinch and other hybrid transformer components☆45Jul 20, 2024Updated last year
- A library for unit scaling in PyTorch☆133Jul 11, 2025Updated 7 months ago
- ☆138Aug 19, 2024Updated last year
- Efficient optimizers☆285Dec 20, 2025Updated 2 months ago
- Official Implementation of "ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate"☆435Dec 12, 2024Updated last year
- Official PyTorch Implementation for Paper "No More Adam: Learning Rate Scaling at Initialization is All You Need"☆56Jan 27, 2025Updated last year
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- [ICML 2024] Official Repository for the paper "Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models"☆10Jul 19, 2024Updated last year
- Scalable and Stable Parallelization of Nonlinear RNNS☆29Oct 21, 2025Updated 4 months ago
- Stick-breaking attention☆62Jul 1, 2025Updated 8 months ago
- Code for the paper "Toward Fully Self-Supervised Multi-Pitch Estimation".☆23Sep 27, 2025Updated 5 months ago
- Schedule-Free Optimization in PyTorch☆2,257May 21, 2025Updated 9 months ago
- Official implementation of "GPT or BERT: why not both?"☆62Jul 28, 2025Updated 7 months ago
- ☆14Mar 20, 2025Updated 11 months ago
- ☆14Apr 14, 2025Updated 10 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆133Dec 3, 2024Updated last year
- [NeurIPS 2025] Official Pytorch Implementation of "The Curse of Depth in Large Language Models" by Wenfang Sun, Xinyuan Song, Pengxiang L…☆67Jan 2, 2026Updated 2 months ago
- Muon is an optimizer for hidden layers in neural networks☆2,329Jan 19, 2026Updated last month
- All information and news with respect to Falcon-H1 series☆108Oct 9, 2025Updated 4 months ago
- ☆28Sep 5, 2024Updated last year
- Pytorch implementation of the invertible CQT based on Non-stationary Gabor filters☆36Jun 20, 2023Updated 2 years ago
- JAX Scalify: end-to-end scaled arithmetics☆18Oct 30, 2024Updated last year
- See https://github.com/cuda-mode/triton-index/ instead!☆11May 8, 2024Updated last year
- Tools to isolate speaker and transcribe unstructured audio clips☆11Dec 4, 2022Updated 3 years ago
- ☆13Apr 7, 2024Updated last year