martin-marek / batch-sizeLinks

📄Small Batch Size Training for Language Models

☆31

Alternatives and similar repositories for batch-size

Users that are interested in batch-size are comparing it to the libraries listed below

Sorting:

google-deepmind / spectral_ssm
☆32Updated last year
vvvm23 / mamba-jax
Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAX
☆84Updated last year
radarFudan / mamba-minimal-jax
☆31Updated 8 months ago
lucidrains / taylor-series-linear-attention
Explorations into the recently proposed Taylor Series Linear Attention
☆99Updated 11 months ago
shikaiqiu / compute-better-spent
☆53Updated 9 months ago
sustcsonglin / mamba-triton
☆48Updated last year
proger / hippogriff
Griffin MQA + Hawk Linear RNN Hybrid
☆87Updated last year
apple / ml-ademamix
☆64Updated 8 months ago
RobertCsordas / moeut
☆82Updated 11 months ago
edwardmilsom / function-space-learning-rates-paper
Code for the paper "Function-Space Learning Rates"
☆22Updated last month
kvfrans / splus
☆114Updated last month
dvruette / barrel-rec-pytorch
☆53Updated last year
Cranial-XIX / longhorn
Official PyTorch Implementation of the Longhorn Deep State Space Model
☆53Updated 7 months ago
fal-ai-community / nano-mdm
Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun
☆55Updated 4 months ago
cloneofsimo / min-max-gpt
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆129Updated last year
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆111Updated 6 months ago
EleutherAI / nanoGPT-mup
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆149Updated 3 weeks ago
fal-ai-community / NativeSparseAttention
research impl of Native Sparse Attention (2502.11089)
☆58Updated 5 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆116Updated 2 weeks ago
johnryan465 / pscan
☆40Updated last year
cloneofsimo / min-fsdp
☆81Updated last year
lucidrains / scaling-vin-pytorch
Exploration into the Scaling Value Iteration Networks paper, from Schmidhuber's group
☆36Updated 10 months ago
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆66Updated 9 months ago
huyphan168 / PEER
Mixture of A Million Experts
☆46Updated 11 months ago
cloneofsimo / scaling-guide
WIP
☆93Updated 11 months ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆62Updated last year
CLAIRE-Labo / flash_attention
A basic pure pytorch implementation of flash attention
☆16Updated 8 months ago
lucidrains / gateloop-transformer
Implementation of GateLoop Transformer in Pytorch and Jax
☆89Updated last year
berlino / seq_icl
☆53Updated last year