arogozhnikov / adamw_bfloat16
AdamW optimizer for bfloat16 models in pytorch ๐ฅ.
โ32Updated 11 months ago
Alternatives and similar repositories for adamw_bfloat16
Users that are interested in adamw_bfloat16 are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modelingโ36Updated last year
- Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixingโ50Updated 3 years ago
- โ32Updated last year
- Engineering the state of RNN language models (Mamba, RWKV, etc.)โ32Updated 11 months ago
- My explorations into editing the knowledge and memories of an attention networkโ34Updated 2 years ago
- Triton Implementation of HyperAttention Algorithmโ48Updated last year
- RWKV model implementationโ37Updated last year
- Source-to-Source Debuggable Derivatives in Pure Pythonโ15Updated last year
- Serialize JAX, Flax, Haiku, or Objax model params with ๐ค`safetensors`โ44Updated 11 months ago
- Unofficially Implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention for PyTorchโ12Updated 3 years ago
- โ21Updated 2 years ago
- Experiment of using Tangent to autodiff tritonโ78Updated last year
- โ29Updated 2 years ago
- Code Release for "Broken Neural Scaling Laws" (BNSL) paperโ58Updated last year
- โ31Updated last month
- Parallel Associative Scan for Language Modelsโ18Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.โ45Updated 10 months ago
- An attempt to merge ESBN with Transformers, to endow Transformers with the ability to emergently bind symbolsโ15Updated 3 years ago
- DiCE: The Infinitely Differentiable Monte-Carlo Estimatorโ31Updated last year
- Latent Diffusion Language Modelsโ68Updated last year
- Code for the note "NF4 Isn't Information Theoretically Optimal (and that's Good)โ18Updated last year
- The official Languini Kitchen repositoryโ14Updated last year
- Using FlexAttention to compute attention with different masking patternsโ43Updated 7 months ago
- Blog postโ17Updated last year
- Standalone pre-training recipe with JAX+Flaxโ31Updated 2 years ago
- A place to store reusable transformer components of my own creation or found on the interwebsโ55Updated this week
- Implements the SM3-II adaptive optimization algorithm for PyTorch.โ33Updated 8 months ago
- Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAXโ83Updated last year
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.โ30Updated this week
- some common Huggingface transformers in maximal update parametrization (ยตP)โ80Updated 3 years ago