epfml / schedules-and-scalingView external linksLinks
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
ā89Oct 30, 2024Updated last year
Alternatives and similar repositories for schedules-and-scaling
Users that are interested in schedules-and-scaling are comparing it to the libraries listed below
Sorting:
- [NeurIPS-2024] š Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623ā89Sep 26, 2024Updated last year
- Checkpointable dataset utilities for foundation model trainingā32Jan 29, 2024Updated 2 years ago
- ā63Oct 3, 2024Updated last year
- An efficient implementation of the NSA (Native Sparse Attention) kernelā129Jun 24, 2025Updated 7 months ago
- Quantized Attention on GPUā44Nov 22, 2024Updated last year
- JAX Scalify: end-to-end scaled arithmeticsā18Oct 30, 2024Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingā42Dec 29, 2025Updated last month
- ā20Nov 4, 2025Updated 3 months ago
- nanoGPT-like codebase for LLM trainingā113Nov 7, 2025Updated 3 months ago
- ā33Nov 4, 2024Updated last year
- ā14Mar 2, 2025Updated 11 months ago
- Stick-breaking attentionā62Jul 1, 2025Updated 7 months ago
- Estimate MFU for DeepSeekV3ā26Jan 5, 2025Updated last year
- Source-to-Source Debuggable Derivatives in Pure Pythonā15Jan 23, 2024Updated 2 years ago
- ā579Sep 23, 2025Updated 4 months ago
- FlexAttention w/ FlashAttention3 Supportā27Oct 5, 2024Updated last year
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]ā147Sep 20, 2024Updated last year
- [ICLR 2025] "Training LMs on Synthetic Edit Sequences Improves Code Synthesis" (Piterbarg, Pinto, Fergus)ā19Feb 11, 2025Updated last year
- A scalable implementation of diffusion and flow-matching with XGBoost models, applied to calorimeter data.ā19Nov 3, 2024Updated last year
- ā34Sep 10, 2024Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.ā92Jul 17, 2025Updated 6 months ago
- A repository for research on medium sized language models.ā77May 23, 2024Updated last year
- The official repository of paper "ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection" (Nā¦ā50Oct 23, 2023Updated 2 years ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]ā21May 2, 2024Updated last year
- ā20May 30, 2024Updated last year
- Solution of Kaggle competition: Feedback Prize - Evaluating Student Writingā16Mar 30, 2022Updated 3 years ago
- ā250Dec 2, 2024Updated last year
- ā34May 14, 2025Updated 9 months ago
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modelingā40Dec 2, 2023Updated 2 years ago
- Supporting code for the blog post on modular manifolds.ā115Sep 26, 2025Updated 4 months ago
- ā53May 20, 2024Updated last year
- This repository contains code for the MicroAdam paper.ā22Dec 14, 2024Updated last year
- ā19Dec 4, 2025Updated 2 months ago
- Language models scale reliably with over-training and on downstream tasksā99Apr 2, 2024Updated last year
- ā129Jun 6, 2025Updated 8 months ago
- ā38Feb 8, 2024Updated 2 years ago
- ā67Mar 21, 2025Updated 10 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793ā452May 13, 2025Updated 9 months ago
- Code for the paper "Distinguishing the Knowable from the Unknowable with Language Models"ā11Apr 15, 2024Updated last year