kyleliang919 / C-OptimLinks
When it comes to optimizers, it's always better to be safe than sorry
☆233Updated 2 months ago
Alternatives and similar repositories for C-Optim
Users that are interested in C-Optim are comparing it to the libraries listed below
Sorting:
- ☆286Updated last month
- Attempt to make multiple residual streams from Bytedance's Hyper-Connections paper accessible to the public☆83Updated 3 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793☆417Updated 3 weeks ago
- Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".☆124Updated last week
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI☆282Updated 2 months ago
- Normalized Transformer (nGPT)☆181Updated 6 months ago
- Muon optimizer: +>30% sample efficiency with <3% wallclock overhead☆661Updated last week
- Implementation of Infini-Transformer in Pytorch☆111Updated 5 months ago
- ☆190Updated this week
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆167Updated 2 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 seconds☆237Updated 3 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆124Updated 9 months ago
- Efficient optimizers☆206Updated this week
- Implementation of the proposed Adam-atan2 from Google Deepmind in Pytorch☆104Updated 6 months ago
- Scalable and Performant Data Loading☆269Updated this week
- Helpful tools and examples for working with flex-attention☆811Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆294Updated 3 months ago
- Official PyTorch Implementation for Paper "No More Adam: Learning Rate Scaling at Initialization is All You Need"☆51Updated 4 months ago
- The AdEMAMix Optimizer: Better, Faster, Older.☆183Updated 8 months ago
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"☆173Updated 11 months ago
- Implementation of a multimodal diffusion transformer in Pytorch☆102Updated 11 months ago
- Just some miscellaneous utility functions / decorators / modules related to Pytorch and Accelerate to help speed up implementation of new…☆120Updated 10 months ago
- Code and weights for the paper "Cluster and Predict Latents Patches for Improved Masked Image Modeling"☆106Updated last month
- Explorations into the recently proposed Taylor Series Linear Attention☆98Updated 9 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆237Updated 4 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆91Updated 2 months ago
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆294Updated 2 months ago
- [ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters☆559Updated 3 months ago
- [ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization☆69Updated 4 months ago
- [CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for C…☆248Updated 4 months ago