martin-marek / batch-sizeLinks
πSmall Batch Size Training for Language Models
β42Updated last week
Alternatives and similar repositories for batch-size
Users that are interested in batch-size are comparing it to the libraries listed below
Sorting:
- β31Updated 8 months ago
- β56Updated 10 months ago
- β49Updated last year
- β33Updated last year
- β53Updated last year
- Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAXβ85Updated last year
- Minimal but scalable implementation of large language models in JAXβ35Updated 3 weeks ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"β81Updated 9 months ago
- β115Updated 2 months ago
- β83Updated last year
- β11Updated 5 months ago
- β53Updated last year
- β83Updated 11 months ago
- supporting pytorch FSDP for optimizersβ84Updated 8 months ago
- Code for the paper "Function-Space Learning Rates"β23Updated 2 months ago
- A MAD laboratory to improve AI architecture designs π§ͺβ124Updated 7 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.β149Updated last month
- Custom triton kernels for training Karpathy's nanoGPT.β19Updated 9 months ago
- Explorations into the recently proposed Taylor Series Linear Attentionβ100Updated 11 months ago
- Triton Implementation of HyperAttention Algorithmβ48Updated last year
- β33Updated 9 months ago
- Parallel Associative Scan for Language Modelsβ18Updated last year
- β65Updated 9 months ago
- Official PyTorch Implementation of the Longhorn Deep State Space Modelβ54Updated 8 months ago
- Stick-breaking attentionβ59Updated last month
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT trainingβ130Updated last year
- Griffin MQA + Hawk Linear RNN Hybridβ88Updated last year
- β40Updated 4 months ago
- A basic pure pytorch implementation of flash attentionβ16Updated 9 months ago
- H-Net Dynamic Hierarchical Architectureβ71Updated 3 weeks ago