lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount of time on any token
☆53Updated last year
Alternatives and similar repositories for pause-transformer:
Users that are interested in pause-transformer are comparing it to the libraries listed below
- ☆51Updated 9 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆70Updated 3 months ago
- ☆71Updated 6 months ago
- Explorations into the recently proposed Taylor Series Linear Attention☆92Updated 6 months ago
- Implementation of Infini-Transformer in Pytorch☆109Updated last month
- Implementation of GateLoop Transformer in Pytorch and Jax☆87Updated 8 months ago
- Exploration into the proposed "Self Reasoning Tokens" by Felipe Bonetto☆55Updated 9 months ago
- ☆78Updated 10 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆118Updated 5 months ago
- ☆37Updated 10 months ago
- ☆47Updated last year
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆25Updated 10 months ago
- Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023☆130Updated 9 months ago
- ☆52Updated 4 months ago
- ☆33Updated 5 months ago
- Here we will test various linear attention designs.☆58Updated 9 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆115Updated 4 months ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆23Updated 11 months ago
- ☆42Updated last year
- Randomized Positional Encodings Boost Length Generalization of Transformers☆79Updated 11 months ago
- Understand and test language model architectures on synthetic tasks.☆181Updated last month
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆60Updated 4 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆122Updated 10 months ago
- nanoGPT-like codebase for LLM training☆89Updated this week
- Language models scale reliably with over-training and on downstream tasks☆96Updated 10 months ago
- ☆75Updated 7 months ago
- ☆77Updated last year
- ☆44Updated last year
- ☆66Updated 7 months ago