abhishekpanigrahi1996 / transformer_in_transformerLinks

☆45

Alternatives and similar repositories for transformer_in_transformer

Users that are interested in transformer_in_transformer are comparing it to the libraries listed below

Sorting:

epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
berlino / seq_icl
☆53Updated last year
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆138Updated last year
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
r-three / RAD
Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
☆45Updated 2 months ago
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆53Updated 2 years ago
gregorbachmann / Next-Token-Failures
☆106Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆81Updated 2 years ago
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
RobertCsordas / moe
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆38Updated 5 months ago
janphilippfranken / sami
Self-Supervised Alignment with Mutual Information
☆21Updated last year
HazyResearch / prefix-linear-attention
☆57Updated last year
sustcsonglin / mamba-triton
☆50Updated last year
NohTow / PPL-MCTS
Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22
☆66Updated 3 years ago
PiotrNawrot / dynamic-pooling
Efficient Transformers with Dynamic Token Pooling
☆65Updated 2 years ago
ethancaballero / broken_neural_scaling_laws
Code Release for "Broken Neural Scaling Laws" (BNSL) paper
☆59Updated 2 years ago
allenai / easy-to-hard-generalization
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Updated last year
sjelassi / transformers_ssm_copy
☆35Updated last year
justinlovelace / Diffusion-Guided-LM
☆29Updated last month
TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
EleutherAI / mdl
Minimum Description Length probing for neural network representations
☆20Updated 10 months ago
chijames / KERPLE
☆20Updated 3 years ago
hughbzhang / o1_inference_scaling_laws
Replicating O1 inference-time scaling laws
☆90Updated last year
princeton-nlp / TransformerPrograms
[NeurIPS 2023] Learning Transformer Programs
☆162Updated last year
yikangshen / megablocks
☆20Updated last year
RobertCsordas / moeut
☆89Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆64Updated last year
sunyt32 / torchscale
Transformers at any scale
☆42Updated last year