sramshetty / mixture-of-depthsView external linksLinks
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆36Jun 7, 2024Updated last year
Alternatives and similar repositories for mixture-of-depths
Users that are interested in mixture-of-depths are comparing it to the libraries listed below
Sorting:
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆114Feb 10, 2026Updated last week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆177Jun 20, 2024Updated last year
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 8 months ago
- Scratchpad/Chain-of-Thought Prompts☆12Jun 6, 2022Updated 3 years ago
- ☆16Jul 29, 2025Updated 6 months ago
- Combining SOAP and MUON☆19Feb 11, 2025Updated last year
- My fork os allen AI's OLMo for educational purposes.☆28Dec 5, 2024Updated last year
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆14Mar 18, 2025Updated 10 months ago
- ☆67Mar 21, 2025Updated 10 months ago
- Collection of autoregressive model implementation☆85Feb 10, 2026Updated last week
- PyTorch implementation of StableMask (ICML'24)☆15Jun 27, 2024Updated last year
- Accelerating Multitask Training Trough Adaptive Transition [Efficient ML Model]☆12May 23, 2025Updated 8 months ago
- ☆25Nov 13, 2025Updated 3 months ago
- Repo du cours d'introduction à l'apprentissage par renforcement.☆15Feb 2, 2025Updated last year
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆77Apr 29, 2024Updated last year
- Training GPTs to solve interaction nets☆18Aug 14, 2024Updated last year
- Official repo of paper LM2☆47Feb 13, 2025Updated last year
- Community Implementation of the paper: "Multi-Head Mixture-of-Experts" In PyTorch☆29Jan 31, 2026Updated 2 weeks ago
- Linear Attention Sequence Parallelism (LASP)☆88Jun 4, 2024Updated last year
- Official PyTorch code release for Implicit Gradient Transport, NeurIPS'19☆21Jun 11, 2019Updated 6 years ago
- ☆28Oct 7, 2025Updated 4 months ago
- Supporting code for the blog post on modular manifolds.☆115Sep 26, 2025Updated 4 months ago
- GraphSnapShot: Caching Local Structure for Fast Graph Learning [Efficient ML System]☆40Jan 1, 2026Updated last month
- Masked Structural Growth for 2x Faster Language Model Pre-training☆25Apr 28, 2024Updated last year
- Unified Graph Transformer (UGT) is a novel Graph Transformer model specialised in preserving both local and global graph structures and d…☆28Jul 17, 2025Updated 7 months ago
- Official Repo for Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics☆71Jan 13, 2026Updated last month
- Modeling code for a BitNet b1.58 Llama-style model.☆25Apr 30, 2024Updated last year
- manipulating cointegrated pairs to achieve a market-neutral strategy that outperforms indices☆12Jan 12, 2021Updated 5 years ago
- Bytecode manipulation in runtime, true shared memory, async LMDB, async Tkinter, async wxPython, async PySide, async PyQt, async loop wit…☆31Nov 25, 2024Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆42Dec 29, 2025Updated last month
- [TMLR 2025 & ICLR 2025 DeLTa] Official Implementation of Design Editing for Offline Model-based Optimization 🧬 🤖☆10Apr 17, 2025Updated 10 months ago
- Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales☆32Jul 17, 2023Updated 2 years ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆89Oct 30, 2024Updated last year
- Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence☆199Updated this week
- some common Huggingface transformers in maximal update parametrization (µP)☆87Mar 14, 2022Updated 3 years ago
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆78Aug 17, 2024Updated last year
- A Deepfake detector based on hybrid EfficientNet CNN and Vision Transformer archietcture. The model is explainable by rendering a heatma…☆15Mar 16, 2022Updated 3 years ago
- Official code for `Visual Attention Emerges from Recurrent Sparse Reconstruction' (ICML 2022)☆36Jul 5, 2022Updated 3 years ago
- ☆35Mar 12, 2025Updated 11 months ago