arogozhnikov / adamw_bfloat16Links

AdamW optimizer for bfloat16 models in pytorch 🔥.

☆37

Alternatives and similar repositories for adamw_bfloat16

Users that are interested in adamw_bfloat16 are comparing it to the libraries listed below

Sorting:

sustcsonglin / gated_linear_attention_layer
☆31Updated last year
lucidrains / token-shift-gpt
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing
☆50Updated 3 years ago
EleutherAI / rnngineering
Engineering the state of RNN language models (Mamba, RWKV, etc.)
☆32Updated last year
drisspg / transformer_nuggets
A place to store reusable transformer components of my own creation or found on the interwebs
☆60Updated last week
crowsonkb / LDLM
Latent Diffusion Language Models
☆68Updated 2 years ago
ClashLuke / PerfTorch
High performance pytorch modules
☆18Updated 2 years ago
lucidrains / rela-transformer
Implementation of a Transformer using ReLA (Rectified Linear Attention) from https://arxiv.org/abs/2104.07012
☆49Updated 3 years ago
srush / tangent
Source-to-Source Debuggable Derivatives in Pure Python
☆15Updated last year
lucidrains / memory-editable-transformer
My explorations into editing the knowledge and memories of an attention network
☆34Updated 2 years ago
google-research / precondition
☆31Updated 3 months ago
BlinkDL / SmallInitEmb
LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence
☆58Updated 3 years ago
ColinQiyangLi / AdaCat
AdaCat
☆49Updated 3 years ago
codekansas / rwkv
RWKV model implementation
☆38Updated 2 years ago
ahennequ / pytorch-custom-mma
☆29Updated 3 years ago
NathanGodey / headless-lm
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆27Updated last year
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
acosharma / elita-transformer
Official Repository for Efficient Linear-Time Attention Transformers.
☆18Updated last year
renll / SeqBoat
[NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling
☆39Updated last year
pytorch / maskedtensor
MaskedTensors for PyTorch
☆38Updated 3 years ago
lucidrains / product-key-memory
Standalone Product Key Memory module in Pytorch - for augmenting Transformer models
☆83Updated last year
antofuller / configaformers
A python library for highly configurable transformers - easing model architecture search and experimentation.
☆49Updated 3 years ago
lucidrains / local-attention-flax
Local Attention - Flax module for Jax
☆22Updated 4 years ago
HomebrewML / HomebrewNLP-torch
A case study of efficient training of large language models using commodity hardware.
☆68Updated 3 years ago
jxiw / BiGS
Official Repository of Pretraining Without Attention (BiGS), BiGS is the first model to achieve BERT-level transfer learning on the GLUE …
☆114Updated last year
jenni-ai / T2FW
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
☆19Updated 3 years ago
microsoft / ResiDual
ResiDual: Transformer with Dual Residual Connections, https://arxiv.org/abs/2304.14802
☆95Updated 2 years ago
proger / nanokitchen
Parallel Associative Scan for Language Models
☆17Updated last year
samblouir / birdie
☆13Updated 4 months ago
lucidrains / einops-exts
Implementation of some personal helper functions for Einops, my most favorite tensor manipulation library ❤️
☆55Updated 2 years ago
lucidrains / autoregressive-linear-attention-cuda
CUDA implementation of autoregressive linear attention, with all the latest research findings
☆44Updated 2 years ago