arogozhnikov / adamw_bfloat16
AdamW optimizer for bfloat16 models in pytorch π₯.
β32Updated 9 months ago
Alternatives and similar repositories for adamw_bfloat16:
Users that are interested in adamw_bfloat16 are comparing it to the libraries listed below
- RWKV model implementationβ37Updated last year
- β33Updated last year
- Unofficially Implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention for PyTorchβ12Updated 3 years ago
- A place to store reusable transformer components of my own creation or found on the interwebsβ48Updated last week
- β31Updated last month
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modelingβ36Updated last year
- Engineering the state of RNN language models (Mamba, RWKV, etc.)β32Updated 9 months ago
- Triton Implementation of HyperAttention Algorithmβ47Updated last year
- Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixingβ48Updated 3 years ago
- My explorations into editing the knowledge and memories of an attention networkβ34Updated 2 years ago
- Utilities for PyTorch distributedβ23Updated 3 weeks ago
- AdaCatβ49Updated 2 years ago
- CUDA implementation of autoregressive linear attention, with all the latest research findingsβ44Updated last year
- Code for the note "NF4 Isn't Information Theoretically Optimal (and that's Good)β18Updated last year
- DiCE: The Infinitely Differentiable Monte-Carlo Estimatorβ31Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.β45Updated 8 months ago
- Implementation of a Transformer using ReLA (Rectified Linear Attention) from https://arxiv.org/abs/2104.07012β49Updated 2 years ago
- Standalone pre-training recipe with JAX+Flaxβ31Updated last year
- Source-to-Source Debuggable Derivatives in Pure Pythonβ15Updated last year
- Serialize JAX, Flax, Haiku, or Objax model params with π€`safetensors`β44Updated 9 months ago
- β21Updated 2 years ago
- β95Updated 9 months ago
- Automatically take good care of your preemptible TPUsβ36Updated last year
- LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergenceβ60Updated 3 years ago
- β52Updated 5 months ago
- Blog postβ17Updated last year
- β29Updated 2 years ago
- A collection of Models, Datasets, DataModules, Callbacks, Metrics, Losses and Loggers to better integrate pytorch-lightning with transforβ¦β47Updated last year
- Tensor Parallelism with JAX + Shard Mapβ11Updated last year
- Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAXβ83Updated last year