martin-marek / batch-sizeLinks
πSmall Batch Size Training for Language Models
β31Updated last week
Alternatives and similar repositories for batch-size
Users that are interested in batch-size are comparing it to the libraries listed below
Sorting:
- β32Updated last year
- Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAXβ84Updated last year
- β31Updated 8 months ago
- Explorations into the recently proposed Taylor Series Linear Attentionβ99Updated 11 months ago
- β53Updated 9 months ago
- β48Updated last year
- Griffin MQA + Hawk Linear RNN Hybridβ87Updated last year
- β64Updated 8 months ago
- β82Updated 11 months ago
- Code for the paper "Function-Space Learning Rates"β22Updated last month
- β114Updated last month
- β53Updated last year
- Official PyTorch Implementation of the Longhorn Deep State Space Modelβ53Updated 7 months ago
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrunβ55Updated 4 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT trainingβ129Updated last year
- Implementation of Infini-Transformer in Pytorchβ111Updated 6 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.β149Updated 3 weeks ago
- research impl of Native Sparse Attention (2502.11089)β58Updated 5 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"β116Updated 2 weeks ago
- β40Updated last year
- β81Updated last year
- Exploration into the Scaling Value Iteration Networks paper, from Schmidhuber's groupβ36Updated 10 months ago
- Triton Implementation of HyperAttention Algorithmβ48Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]β66Updated 9 months ago
- Mixture of A Million Expertsβ46Updated 11 months ago
- WIPβ93Updated 11 months ago
- Here we will test various linear attention designs.β62Updated last year
- A basic pure pytorch implementation of flash attentionβ16Updated 8 months ago
- Implementation of GateLoop Transformer in Pytorch and Jaxβ89Updated last year
- β53Updated last year