zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
β407Updated last week
Alternatives and similar repositories for Adam-mini:
Users that are interested in Adam-mini are comparing it to the libraries listed below
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ577Updated last month
- When it comes to optimizers, it's always better to be safe than sorryβ220Updated 3 weeks ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β240Updated last week
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ601Updated last month
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ280Updated last month
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ551Updated 2 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β198Updated 9 months ago
- Helpful tools and examples for working with flex-attentionβ726Updated 2 weeks ago
- β272Updated this week
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ511Updated 6 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β320Updated 4 months ago
- Implementation of DoRAβ294Updated 10 months ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ328Updated 10 months ago
- Some preliminary explorations of Mamba's context scaling.β212Updated last year
- Normalized Transformer (nGPT)β171Updated 5 months ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β636Updated last month
- β219Updated 10 months ago
- β185Updated last week
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ770Updated 6 months ago
- Ring attention implementation with flash attentionβ743Updated 2 weeks ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β289Updated last week
- β419Updated this week
- β217Updated 10 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β231Updated 2 months ago
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ390Updated 5 months ago
- A project to improve skills of large language modelsβ295Updated this week
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)β347Updated 2 months ago
- TransMLA: Multi-Head Latent Attention Is All You Needβ238Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ214Updated last week
- The official implementation of Tensor ProducT ATTenTion Transformer (T6)β361Updated last week