lucidrains / nGPT-pytorchView external linksLinks
Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI
☆293Jun 3, 2025Updated 8 months ago
Alternatives and similar repositories for nGPT-pytorch
Users that are interested in nGPT-pytorch are comparing it to the libraries listed below
Sorting:
- Normalized Transformer (nGPT)☆198Nov 19, 2024Updated last year
- Implementation of the proposed Spline-Based Transformer from Disney Research☆105Nov 9, 2024Updated last year
- ☆55Nov 22, 2024Updated last year
- Implementation of the proposed Adam-atan2 from Google Deepmind in Pytorch☆135Oct 15, 2025Updated 4 months ago
- Minimal implementation of TokenFormer for inference and learning☆13Nov 6, 2024Updated last year
- Attempt to make multiple residual streams from Bytedance's Hyper-Connections paper accessible to the public☆172Feb 4, 2026Updated last week
- [ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters☆588Feb 11, 2025Updated last year
- Implementation of the proposed minGRU in Pytorch☆319Dec 10, 2025Updated 2 months ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆549May 16, 2025Updated 9 months ago
- Implementation of the proposed MaskBit from Bytedance AI☆83Nov 12, 2024Updated last year
- Unofficial implementation of Titans, SOTA memory for transformers, in Pytorch☆1,935Feb 9, 2026Updated last week
- Associative scan package for DRYing some code between repos☆18Jan 5, 2026Updated last month
- Explorations into the proposal from the paper "Grokfast, Accelerated Grokking by Amplifying Slow Gradients"☆103Dec 22, 2024Updated last year
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆137Dec 19, 2025Updated last month
- Helpful tools and examples for working with flex-attention☆1,127Feb 8, 2026Updated last week
- My attempts at applying Soundstream design on learned tokenization of text and then applying hierarchical attention to text generation☆90Oct 11, 2024Updated last year
- Pretraining and inference code for a large-scale depth-recurrent language model☆859Dec 29, 2025Updated last month
- Implementation of a single layer of the MMDiT, proposed in Stable Diffusion 3, in Pytorch☆514Jan 18, 2026Updated 3 weeks ago
- Stick-breaking attention☆62Jul 1, 2025Updated 7 months ago
- NanoGPT (124M) in 2 minutes☆4,624Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆265Oct 3, 2025Updated 4 months ago
- Implementation of CALM from the paper "LLM Augmented LLMs: Expanding Capabilities through Composition", out of Google Deepmind☆179Sep 12, 2024Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆132Dec 3, 2024Updated last year
- Implementation of the paper "Variable Bitrate Residual Vector Quantization for Audio Coding"☆11Apr 10, 2025Updated 10 months ago
- Just a repository that will house some MLPs and their variants, so to avoid having to reimplement them again and again for different proj…☆45Jan 29, 2026Updated 2 weeks ago
- Implementation of a holodeck, written in Pytorch☆18Nov 1, 2023Updated 2 years ago
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- Implementation of 2-simplicial attention proposed by Clift et al. (2019) and the recent attempt to make practical in Fast and Simplex, Ro…☆46Sep 2, 2025Updated 5 months ago
- Muon is Scalable for LLM Training☆1,432Aug 3, 2025Updated 6 months ago
- Muon is an optimizer for hidden layers in neural networks☆2,290Jan 19, 2026Updated 3 weeks ago
- Implementation of a Light Recurrent Unit in Pytorch☆49Oct 6, 2024Updated last year
- ☆41May 15, 2023Updated 2 years ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆371Dec 12, 2024Updated last year
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆248Jun 6, 2025Updated 8 months ago
- RWKV6 in native pytorch and triton:)☆11Aug 4, 2024Updated last year
- Axial Positional Embedding for Pytorch☆84Feb 25, 2025Updated 11 months ago
- Minimal implementation of scalable rectified flow transformers, based on SD3's approach☆632Jul 1, 2024Updated last year
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆453Sep 15, 2025Updated 5 months ago
- ☆579Sep 23, 2025Updated 4 months ago