abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated last year
Alternatives and similar repositories for transformer_in_transformer:
Users that are interested in transformer_in_transformer are comparing it to the libraries listed below
- ☆51Updated 11 months ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆27Updated last year
- ☆47Updated last year
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆37Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆53Updated last year
- Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model☆44Updated last year
- ☆32Updated last year
- Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"☆23Updated this week
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆54Updated last year
- Self-Supervised Alignment with Mutual Information☆18Updated 11 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆72Updated 6 months ago
- Exploration of automated dataset selection approaches at large scales.☆39Updated 2 months ago
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling☆36Updated last year
- ☆31Updated last year
- ☆20Updated 11 months ago
- ☆16Updated this week
- Blog post☆17Updated last year
- Code for our ACL '23 paper titled "Grokking of Hierarchical Structure in Vanilla Transformers"☆21Updated last year
- ☆11Updated 11 months ago
- Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023☆135Updated last year
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆80Updated last year
- ☆39Updated 2 years ago
- ☆28Updated last year
- ☆33Updated last year
- Stick-breaking attention☆52Updated last month
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆47Updated last year
- Sparse Backpropagation for Mixture-of-Expert Training☆29Updated 10 months ago
- The repository contains code for Adaptive Data Optimization☆24Updated 4 months ago
- [ACL 2023]: Training Trajectories of Language Models Across Scales https://arxiv.org/pdf/2212.09803.pdf☆23Updated last year
- ☆54Updated last year