kyleliang919 / C-Optim
When it comes to optimizers, it's always better to be safe than sorry
☆143Updated last week
Alternatives and similar repositories for C-Optim:
Users that are interested in C-Optim are comparing it to the libraries listed below
- Implementation of Infini-Transformer in Pytorch☆106Updated 2 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆115Updated 3 months ago
- APOLLO: SGD-like Memory, AdamW-level Performance☆66Updated this week
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆99Updated 6 months ago
- Low-bit optimizers for PyTorch☆121Updated last year
- ☆235Updated 3 months ago
- Implementation of the proposed Adam-atan2 from Google Deepmind in Pytorch☆99Updated 3 weeks ago
- Implementation of the proposed MaskBit from Bytedance AI☆66Updated last month
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"☆162Updated 5 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆181Updated this week
- Implementation of a multimodal diffusion transformer in Pytorch☆98Updated 5 months ago
- Muon optimizer for neural networks: >30% extra sample efficiency, <3% wallclock overhead☆180Updated this week
- Just some miscellaneous utility functions / decorators / modules related to Pytorch and Accelerate to help speed up implementation of new…☆119Updated 4 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆94Updated 2 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793☆360Updated last week
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆248Updated 7 months ago
- ☆170Updated 2 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated 7 months ago
- ☆98Updated 9 months ago
- Some preliminary explorations of Mamba's context scaling.☆200Updated 10 months ago
- Normalized Transformer (nGPT)☆136Updated 3 weeks ago
- Implementation of Agent Attention in Pytorch☆87Updated 5 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 6 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆146Updated 2 months ago
- Griffin MQA + Hawk Linear RNN Hybrid☆85Updated 7 months ago
- ☆74Updated 5 months ago
- Implementation of a single layer of the MMDiT, proposed in Stable Diffusion 3, in Pytorch☆268Updated 3 months ago
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆60Updated this week
- DeMo: Decoupled Momentum Optimization☆147Updated 2 weeks ago
- Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆45Updated this week