arogozhnikov / adamw_bfloat16Links
AdamW optimizer for bfloat16 models in pytorch π₯.
β33Updated last year
Alternatives and similar repositories for adamw_bfloat16
Users that are interested in adamw_bfloat16 are comparing it to the libraries listed below
Sorting:
- β32Updated last year
- Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixingβ50Updated 3 years ago
- A place to store reusable transformer components of my own creation or found on the interwebsβ56Updated this week
- Engineering the state of RNN language models (Mamba, RWKV, etc.)β32Updated last year
- My explorations into editing the knowledge and memories of an attention networkβ35Updated 2 years ago
- β29Updated 2 years ago
- Implementation of a Transformer using ReLA (Rectified Linear Attention) from https://arxiv.org/abs/2104.07012β49Updated 3 years ago
- CUDA implementation of autoregressive linear attention, with all the latest research findingsβ44Updated 2 years ago
- AdaCatβ49Updated 2 years ago
- β31Updated last month
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modelingβ37Updated last year
- Source-to-Source Debuggable Derivatives in Pure Pythonβ15Updated last year
- RWKV model implementationβ38Updated 2 years ago
- A case study of efficient training of large language models using commodity hardware.β68Updated 2 years ago
- Unofficially Implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention for PyTorchβ12Updated 3 years ago
- Latent Diffusion Language Modelsβ68Updated last year
- High performance pytorch modulesβ18Updated 2 years ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.β46Updated last year
- Tensor Parallelism with JAX + Shard Mapβ11Updated last year
- Code for the note "NF4 Isn't Information Theoretically Optimal (and that's Good)β19Updated 2 years ago
- [Oral; Neurips OPT2024 ] ΞΌLO: Compute-Efficient Meta-Generalization of Learned Optimizersβ13Updated 4 months ago
- LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergenceβ59Updated 3 years ago
- Utilities for PyTorch distributedβ24Updated 4 months ago
- An attempt to merge ESBN with Transformers, to endow Transformers with the ability to emergently bind symbolsβ16Updated 3 years ago
- A collection of Models, Datasets, DataModules, Callbacks, Metrics, Losses and Loggers to better integrate pytorch-lightning with transforβ¦β47Updated 2 years ago
- Triton Implementation of HyperAttention Algorithmβ48Updated last year
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/β¦β27Updated last year
- MaskedTensors for PyTorchβ38Updated 3 years ago
- Standalone Product Key Memory module in Pytorch - for augmenting Transformer modelsβ82Updated 11 months ago
- Serialize JAX, Flax, Haiku, or Objax model params with π€`safetensors`β45Updated last year