An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆36Jun 7, 2024Updated last year
Alternatives and similar repositories for mixture-of-depths
Users that are interested in mixture-of-depths are comparing it to the libraries listed below
Sorting:
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆115Mar 3, 2026Updated last week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆177Jun 20, 2024Updated last year
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 9 months ago
- ☆16Jul 29, 2025Updated 7 months ago
- ☆15Mar 2, 2025Updated last year
- Combining SOAP and MUON☆19Feb 11, 2025Updated last year
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆15Feb 12, 2026Updated 3 weeks ago
- ☆67Mar 21, 2025Updated 11 months ago
- Collection of autoregressive model implementation☆85Feb 23, 2026Updated 2 weeks ago
- PyTorch implementation of StableMask (ICML'24)☆15Jun 27, 2024Updated last year
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆37Oct 9, 2025Updated 5 months ago
- ☆26Nov 13, 2025Updated 3 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆78Apr 29, 2024Updated last year
- Training GPTs to solve interaction nets☆18Aug 14, 2024Updated last year
- Official repo of paper LM2☆47Feb 13, 2025Updated last year
- Linear Attention Sequence Parallelism (LASP)☆89Jun 4, 2024Updated last year
- Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M cont…☆86May 9, 2024Updated last year
- ☆28Oct 7, 2025Updated 5 months ago
- Official PyTorch code release for Implicit Gradient Transport, NeurIPS'19☆21Jun 11, 2019Updated 6 years ago
- ☆23Jul 7, 2023Updated 2 years ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆460Apr 18, 2024Updated last year
- Supporting code for the blog post on modular manifolds.☆117Sep 26, 2025Updated 5 months ago
- Masked Structural Growth for 2x Faster Language Model Pre-training☆25Apr 28, 2024Updated last year
- GraphSnapShot: Caching Local Structure for Fast Graph Learning [Efficient ML System]☆40Jan 1, 2026Updated 2 months ago
- Unified Graph Transformer (UGT) is a novel Graph Transformer model specialised in preserving both local and global graph structures and d…☆28Jul 17, 2025Updated 7 months ago
- Official Repo for Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics☆71Jan 13, 2026Updated last month
- Modeling code for a BitNet b1.58 Llama-style model.☆25Apr 30, 2024Updated last year
- Efficient Foundation Model Design: A Perspective From Model and System Co-Design [Efficient ML System & Model]☆29Feb 23, 2025Updated last year
- Official codebase for "The Generalization Gap in Offline Reinforcement Learning" accepted to ICLR 2024☆28Feb 20, 2026Updated 2 weeks ago
- A big_vision inspired repo that implements a generic Auto-Encoder class capable in representation learning and generative modeling.☆34Jun 26, 2024Updated last year
- Bytecode manipulation in runtime, true shared memory, async LMDB, async Tkinter, async wxPython, async PySide, async PyQt, async loop wit…☆31Nov 25, 2024Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆42Dec 29, 2025Updated 2 months ago
- manipulating cointegrated pairs to achieve a market-neutral strategy that outperforms indices☆12Jan 12, 2021Updated 5 years ago
- Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales☆32Jul 17, 2023Updated 2 years ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆92Oct 30, 2024Updated last year
- some common Huggingface transformers in maximal update parametrization (µP)☆87Mar 14, 2022Updated 3 years ago
- Official code for `Visual Attention Emerges from Recurrent Sparse Reconstruction' (ICML 2022)☆36Jul 5, 2022Updated 3 years ago
- Set of scripts to finetune LLMs☆38Mar 30, 2024Updated last year
- ☆34Mar 12, 2025Updated 11 months ago