lucidrains / st-moe-pytorch
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
β297Updated 7 months ago
Alternatives and similar repositories for st-moe-pytorch:
Users that are interested in st-moe-pytorch are comparing it to the libraries listed below
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ256Updated 8 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ492Updated 2 months ago
- A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language modelsβ668Updated last year
- Implementation of Recurrent Memory Transformer, Neurips 2022 paper, in Pytorchβ403Updated last week
- Large Context Attentionβ670Updated 5 months ago
- β180Updated this week
- Some preliminary explorations of Mamba's context scaling.β206Updated 11 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β219Updated last month
- Understand and test language model architectures on synthetic tasks.β175Updated this week
- Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPTβ205Updated 4 months ago
- β251Updated last year
- Annotated version of the Mamba paperβ469Updated 10 months ago
- Explorations into some recent techniques surrounding speculative decodingβ229Updated 3 weeks ago
- Official PyTorch implementation of QA-LoRAβ122Updated 10 months ago
- Official implementation of TransNormerLLM: A Faster and Better LLMβ233Updated 11 months ago
- β135Updated last year
- β212Updated 7 months ago
- β596Updated this week
- PyTorch implementation of Infini-Transformer from "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attentionβ¦β286Updated 8 months ago
- Scaling Data-Constrained Language Modelsβ330Updated 3 months ago
- Repo for "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture"β541Updated 2 weeks ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.β382Updated 9 months ago
- Helpful tools and examples for working with flex-attentionβ583Updated this week
- The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reductionβ377Updated 6 months ago
- Recurrent Memory Transformerβ148Updated last year
- Triton-based implementation of Sparse Mixture of Experts.β192Updated last month
- Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ477Updated this week
- Implementation of Infini-Transformer in Pytorchβ107Updated 2 weeks ago
- Sequence modeling with Mega.β297Updated last year
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmindβ115Updated 4 months ago