ironjr / grokfast
Official repository for the paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
☆488Updated 2 months ago
Related projects: ⓘ
- Annotated version of the Mamba paper☆445Updated 6 months ago
- GPT-2 (124M) quality in 5B tokens☆227Updated last week
- Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"☆778Updated last month
- Implementation of Diffusion Transformer (DiT) in JAX☆246Updated 3 months ago
- Open weights language model from Google DeepMind, based on Griffin.☆595Updated 2 months ago
- UNet diffusion model in pure CUDA☆562Updated 2 months ago
- Training small GPT-2 style models using Kolmogorov-Arnold networks.☆105Updated 3 months ago
- Code repository for Black Mamba☆218Updated 7 months ago
- Mamba-Chat: A chat LLM based on the state-space model architecture 🐍☆897Updated 6 months ago
- Schedule-Free Optimization in PyTorch☆1,809Updated last month
- A repository for log-time feedforward networks☆215Updated 5 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆452Updated last week
- ☆428Updated last month
- The AdEMAMix Optimizer: Better, Faster, Older.☆132Updated last week
- ☆288Updated 2 months ago
- System 2 Reasoning Link Collection☆605Updated this week
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"☆530Updated 4 months ago
- Simple, minimal implementation of the Mamba SSM in one pytorch file. More efficient than using for loops, but probably less efficient tha…☆89Updated 5 months ago
- The repository for the code of the UltraFastBERT paper☆508Updated 5 months ago
- PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention…☆271Updated 4 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793☆285Updated this week
- Build high-performance AI models with modular building blocks☆377Updated this week
- Kolmogorov-Arnold Networks (KAN) using Chebyshev polynomials instead of B-splines.☆336Updated 4 months ago
- Library for Jacobian descent with PyTorch. It enables optimization of neural networks with multiple losses (e.g. multi-task learning).☆126Updated this week
- The Tensor (or Array)☆388Updated last month
- Fast bare-bones BPE for modern tokenizer training☆138Updated 3 weeks ago
- A Jax-based library for designing and training transformer models from scratch.☆272Updated 3 weeks ago
- Sparse autoencoders☆297Updated last week
- Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta☆103Updated last week
- Stop messing around with finicky sampling parameters and just use DRµGS!☆313Updated 3 months ago