epfml / schedules-and-scaling
☆50Updated last week
Related projects ⓘ
Alternatives and complementary repositories for schedules-and-scaling
- ☆61Updated 2 months ago
- Language models scale reliably with over-training and on downstream tasks☆94Updated 7 months ago
- ☆50Updated 5 months ago
- ☆50Updated last month
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- ☆26Updated 4 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated 10 months ago
- ☆20Updated this week
- ☆24Updated 8 months ago
- Minimal but scalable implementation of large language models in JAX☆25Updated last week
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆83Updated last week
- ☆46Updated last month
- ☆34Updated 3 months ago
- Using FlexAttention to compute attention with different masking patterns☆40Updated last month
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆24Updated 6 months ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆44Updated 9 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated 11 months ago
- Stick-breaking attention☆32Updated last week
- NanoGPT-like codebase for LLM training☆73Updated this week
- GoldFinch and other hybrid transformer components☆39Updated 3 months ago
- [ICML 24 NGSM workshop] Associative Recurrent Memory Transformer implementation and scripts for training and evaluating☆29Updated this week
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆49Updated last year
- This repo is based on https://github.com/jiaweizzhao/GaLore☆18Updated last month
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆54Updated 3 months ago
- ☆76Updated 5 months ago
- A MAD laboratory to improve AI architecture designs 🧪☆95Updated 6 months ago
- A framework for few-shot evaluation of autoregressive language models.☆23Updated 10 months ago
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆29Updated 3 weeks ago
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆67Updated last month