lucidrains / st-moe-pytorch
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
β328Updated 10 months ago
Alternatives and similar repositories for st-moe-pytorch:
Users that are interested in st-moe-pytorch are comparing it to the libraries listed below
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ283Updated 3 weeks ago
- A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language modelsβ727Updated last year
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ511Updated 5 months ago
- Implementation of Recurrent Memory Transformer, Neurips 2022 paper, in Pytorchβ407Updated 3 months ago
- Some preliminary explorations of Mamba's context scaling.β212Updated last year
- Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nΒ²) Memory"β376Updated last year
- β184Updated this week
- Large Context Attentionβ704Updated 2 months ago
- Understand and test language model architectures on synthetic tasks.β192Updated last month
- Official implementation of TransNormerLLM: A Faster and Better LLMβ243Updated last year
- Annotated version of the Mamba paperβ483Updated last year
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ548Updated 2 months ago
- β621Updated last week
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β549Updated 3 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β232Updated 2 months ago
- Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorchβ226Updated 7 months ago
- Helpful tools and examples for working with flex-attentionβ726Updated last week
- β290Updated 4 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β405Updated last week
- β219Updated 10 months ago
- Recurrent Memory Transformerβ149Updated last year
- Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPTβ211Updated 8 months ago
- β143Updated last year
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorchβ666Updated 4 months ago
- Implementation of Block Recurrent Transformer - Pytorchβ217Updated 8 months ago
- β255Updated last year
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from β¦β162Updated 11 months ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ405Updated 8 months ago
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ575Updated 3 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β158Updated 10 months ago