cloneofsimo / minSAELinks
☆29Updated 7 months ago
Alternatives and similar repositories for minSAE
Users that are interested in minSAE are comparing it to the libraries listed below
Sorting:
- supporting pytorch FSDP for optimizers☆82Updated 7 months ago
- WIP☆93Updated 11 months ago
- Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam☆82Updated 11 months ago
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆54Updated 4 months ago
- Focused on fast experimentation and simplicity☆76Updated 6 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆129Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆147Updated 2 weeks ago
- ☆19Updated last month
- ☆79Updated last year
- ☆61Updated 8 months ago
- research impl of Native Sparse Attention (2502.11089)☆54Updated 4 months ago
- ☆197Updated 7 months ago
- Efficient optimizers☆232Updated last week
- ☆37Updated 3 months ago
- Minimal (truly) muP implementation, consistent with TP4 and TP5 papers notation☆14Updated last month
- DeMo: Decoupled Momentum Optimization☆189Updated 7 months ago
- ☆110Updated last month
- An implementation of the Llama architecture, to instruct and delight☆21Updated last month
- Mixture of A Million Experts☆46Updated 11 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆127Updated 10 months ago
- ☆34Updated 10 months ago
- Accelerated First Order Parallel Associative Scan☆182Updated 10 months ago
- Code accompanying the paper "Generalized Interpolating Discrete Diffusion"☆91Updated last month
- ☆26Updated 2 weeks ago
- ☆52Updated last year
- Maximal Update Parametrization (μP) with Flax & Optax.☆11Updated last year
- JAX Implementation of Liger Kernels☆9Updated 8 months ago
- ☆24Updated 2 months ago
- ☆53Updated last year
- ☆53Updated last year