conceptofmind / t5-pytorchLinks
Implementation of Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer in PyTorch.
☆51Updated last year
Alternatives and similar repositories for t5-pytorch
Users that are interested in t5-pytorch are comparing it to the libraries listed below
Sorting:
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆129Updated last year
- Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023☆135Updated last year
- ☆85Updated last year
- Implementation of Infini-Transformer in Pytorch☆113Updated 9 months ago
- ☆107Updated 2 years ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆55Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasks☆100Updated last year
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆119Updated 11 months ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆52Updated last year
- ☆106Updated last year
- ☆101Updated last year
- Implementation of CALM from the paper "LLM Augmented LLMs: Expanding Capabilities through Composition", out of Google Deepmind☆177Updated last year
- Token Omission Via Attention☆128Updated 11 months ago
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆55Updated 8 months ago
- Griffin MQA + Hawk Linear RNN Hybrid☆89Updated last year
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆99Updated last year
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆180Updated 3 months ago
- Here we will test various linear attention designs.☆61Updated last year
- ☆95Updated last year
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆43Updated 11 months ago
- Mixture of A Million Experts☆48Updated last year
- This is the implementation of the paper AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning (https://arxiv.org/abs/2205.1…☆134Updated 2 years ago
- Replicating O1 inference-time scaling laws☆90Updated 10 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
- ☆151Updated 10 months ago
- Randomized Positional Encodings Boost Length Generalization of Transformers☆82Updated last year
- ☆52Updated last year
- Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch☆230Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated 2 years ago
- Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"☆90Updated last year