zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
β388Updated 2 months ago
Alternatives and similar repositories for Adam-mini:
Users that are interested in Adam-mini are comparing it to the libraries listed below
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ434Updated this week
- When it comes to optimizers, it's always better to be safe than sorryβ180Updated last week
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ503Updated 4 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β195Updated 7 months ago
- β256Updated last week
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ726Updated 5 months ago
- Implementation of DoRAβ290Updated 8 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β301Updated 2 months ago
- β212Updated 8 months ago
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)β327Updated last month
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β225Updated this week
- Some preliminary explorations of Mamba's context scaling.β213Updated last year
- PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attentionβ¦β286Updated 9 months ago
- Helpful tools and examples for working with flex-attentionβ662Updated last week
- The official implementation of Tensor ProducT ATTenTion Transformer (T6)β311Updated last week
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ273Updated 3 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β222Updated 2 weeks ago
- Normalized Transformer (nGPT)β156Updated 3 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β218Updated last month
- β217Updated 8 months ago
- β142Updated 2 weeks ago
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ528Updated 2 weeks ago
- β181Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β515Updated last week
- Official PyTorch implementation of QA-LoRAβ127Updated 11 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β149Updated 2 months ago
- Efficient LLM Inference over Long Sequencesβ362Updated 2 weeks ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β274Updated last week