lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆115Updated 5 months ago
Alternatives and similar repositories for PEER-pytorch:
Users that are interested in PEER-pytorch are comparing it to the libraries listed below
- Mixture of A Million Experts☆33Updated 5 months ago
- When it comes to optimizers, it's always better to be safe than sorry☆166Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆113Updated last month
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆150Updated 3 weeks ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆219Updated last month
- Griffin MQA + Hawk Linear RNN Hybrid☆85Updated 9 months ago
- Understand and test language model architectures on synthetic tasks.☆177Updated 2 weeks ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆96Updated 3 months ago
- Normalized Transformer (nGPT)☆146Updated 2 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆23Updated 4 months ago
- Implementation of Infini-Transformer in Pytorch☆109Updated 3 weeks ago
- Some preliminary explorations of Mamba's context scaling.☆209Updated 11 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆149Updated last month
- ☆70Updated 5 months ago
- Explorations into the recently proposed Taylor Series Linear Attention☆92Updated 5 months ago
- Token Omission Via Attention☆122Updated 3 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆91Updated 2 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated last year
- ☆74Updated last year
- ☆136Updated last year
- ☆180Updated this week
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆99Updated 7 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆121Updated 9 months ago
- ☆78Updated 9 months ago
- ☆80Updated 4 months ago
- ☆75Updated 6 months ago
- ☆85Updated 8 months ago
- Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆112Updated 3 weeks ago
- ☆66Updated 6 months ago
- Here we will test various linear attention designs.☆58Updated 9 months ago