lucidrains / nGPT-pytorch
Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI
☆270Updated 2 months ago
Alternatives and similar repositories for nGPT-pytorch:
Users that are interested in nGPT-pytorch are comparing it to the libraries listed below
- Normalized Transformer (nGPT)☆145Updated last month
- Muon optimizer for neural networks: >30% extra sample efficiency, <3% wallclock overhead☆210Updated last week
- Annotated version of the Mamba paper☆469Updated 10 months ago
- Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters☆477Updated this week
- ☆180Updated this week
- Some preliminary explorations of Mamba's context scaling.☆206Updated 11 months ago
- ☆240Updated 4 months ago
- ☆152Updated last month
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆492Updated 2 months ago
- Simplified Masked Diffusion Language Model☆251Updated last month
- When it comes to optimizers, it's always better to be safe than sorry☆157Updated this week
- A MAD laboratory to improve AI architecture designs 🧪☆102Updated last month
- ☆146Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆188Updated 2 weeks ago
- Implementation of Diffusion Transformer (DiT) in JAX☆261Updated 7 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆277Updated last month
- Efficient optimizers☆144Updated this week
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆219Updated last month
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆505Updated 2 months ago
- The AdEMAMix Optimizer: Better, Faster, Older.☆178Updated 4 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆115Updated 4 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793☆376Updated last month
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)☆454Updated 10 months ago
- Helpful tools and examples for working with flex-attention☆583Updated this week
- supporting pytorch FSDP for optimizers☆75Updated last month
- DeMo: Decoupled Momentum Optimization☆170Updated last month
- ☆304Updated 2 weeks ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆90Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆203Updated 3 weeks ago
- Implementation of the Llama architecture with RLHF + Q-learning☆157Updated last year