zichongli5 / NorMuonLinks
Official Implementation for NorMuon paper
☆39Updated 3 weeks ago
Alternatives and similar repositories for NorMuon
Users that are interested in NorMuon are comparing it to the libraries listed below
Sorting:
- ☆13Updated 10 months ago
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆57Updated 8 months ago
- ☆88Updated last year
- ☆47Updated last month
- Code for the paper "Function-Space Learning Rates"☆23Updated 5 months ago
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆74Updated 2 weeks ago
- Efficient PScan implementation in PyTorch☆17Updated last year
- ☆34Updated last year
- Code for "Theoretical Foundations of Deep Selective State-Space Models" (NeurIPS 2024)☆15Updated 10 months ago
- Unofficial Implementation of Selective Attention Transformer☆17Updated last year
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's l…☆51Updated 4 months ago
- [NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Se…☆66Updated last year
- Triton Implementation of HyperAttention Algorithm☆48Updated last year
- ☆82Updated last year
- Supporting code for the blog post on modular manifolds.☆103Updated 2 months ago
- GoldFinch and other hybrid transformer components☆45Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆53Updated 2 years ago
- Implementation of GateLoop Transformer in Pytorch and Jax☆91Updated last year
- ☆61Updated last year
- JAX Scalify: end-to-end scaled arithmetics☆17Updated last year
- Code for the paper "Cottention: Linear Transformers With Cosine Attention"☆20Updated 2 weeks ago
- Implementation of Infini-Transformer in Pytorch☆113Updated 10 months ago
- ☆36Updated 8 months ago
- ☆32Updated last year
- Fork of Flame repo for training of some new stuff in development☆19Updated last week
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆45Updated last month
- ☆68Updated last year
- ☆32Updated last year
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"☆85Updated 2 months ago
- Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch☆25Updated 10 months ago