lucidrains / mixture-of-attentionView external linksLinks
Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts
☆123Oct 17, 2024Updated last year
Alternatives and similar repositories for mixture-of-attention
Users that are interested in mixture-of-attention are comparing it to the libraries listed below
Sorting:
- Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk☆47Jul 16, 2023Updated 2 years ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆53Oct 22, 2023Updated 2 years ago
- Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT☆224Aug 20, 2024Updated last year
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆46May 23, 2023Updated 2 years ago
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆220Feb 13, 2023Updated 3 years ago
- Implementation of the Kalman Filtering Attention proposed in "Kalman Filtering Attention for User Behavior Modeling in CTR Prediction"☆59Oct 22, 2023Updated 2 years ago
- Explorations into the recently proposed Taylor Series Linear Attention☆100Aug 18, 2024Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆344Apr 2, 2025Updated 10 months ago
- Implementation of Block Recurrent Transformer - Pytorch☆224Aug 20, 2024Updated last year
- Utilities for PyTorch distributed☆25Feb 27, 2025Updated 11 months ago
- Implementation of the Llama architecture with RLHF + Q-learning☆170Feb 1, 2025Updated last year
- Implementation of Recurrent Interface Network (RIN), for highly efficient generation of images and video without cascading networks, in P…☆207Feb 14, 2024Updated 2 years ago
- [ICML 2025] LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models☆17Nov 4, 2025Updated 3 months ago
- Implementation of a holodeck, written in Pytorch☆18Nov 1, 2023Updated 2 years ago
- Local Attention - Flax module for Jax☆22May 26, 2021Updated 4 years ago
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Jun 21, 2023Updated 2 years ago
- Implementation of Flash Attention in Jax☆225Mar 1, 2024Updated last year
- [Findings of NAACL2022] A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation☆28Dec 9, 2022Updated 3 years ago
- Implementation of Discrete Key / Value Bottleneck, in Pytorch☆88Jul 9, 2023Updated 2 years ago
- Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing☆49Jan 27, 2022Updated 4 years ago
- Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena☆207Aug 26, 2023Updated 2 years ago
- 친절한 실전 딥러닝 수업☆12Sep 22, 2020Updated 5 years ago
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weights☆19Oct 9, 2022Updated 3 years ago
- A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models☆848Sep 13, 2023Updated 2 years ago
- Implementation of Denoising Diffusion for protein design, but using the new Equiformer (successor to SE3 Transformers) with some addition…☆57Dec 27, 2022Updated 3 years ago
- Implementation of the transformer proposed in "Building Blocks for a Complex-Valued Transformer Architecture"☆88Oct 13, 2023Updated 2 years ago
- 한국어 T5 모델☆54Dec 7, 2021Updated 4 years ago
- ☆11Nov 23, 2021Updated 4 years ago
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆12Jun 28, 2025Updated 7 months ago
- Exploring finetuning public checkpoints on filter 8K sequences on Pile☆115Mar 22, 2023Updated 2 years ago
- "Why do I feel offended?" - Korean Dataset for Offensive Language Identification (EACL2023 Findings)☆15May 14, 2023Updated 2 years ago
- Repo for "Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks" ACL 2023 Findings☆15May 3, 2023Updated 2 years ago
- My attempts at applying Soundstream design on learned tokenization of text and then applying hierarchical attention to text generation☆90Oct 11, 2024Updated last year
- Implementation of the proposed Adam-atan2 from Google Deepmind in Pytorch☆135Oct 15, 2025Updated 4 months ago
- This is a simple torch implementation of the high performance Multi-Query Attention☆16Aug 23, 2023Updated 2 years ago
- 사전에서 대화 예문만 추출한 데이터☆16Apr 24, 2023Updated 2 years ago
- Beyond LM: How can language model go forward in the future?☆15Apr 30, 2023Updated 2 years ago
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch☆804Jan 30, 2026Updated 2 weeks ago
- Learning to Model Editing Processes☆26Aug 3, 2025Updated 6 months ago