srush / do-we-need-attentionLinks

☆166

Alternatives and similar repositories for do-we-need-attention

Users that are interested in do-we-need-attention are comparing it to the libraries listed below

Sorting:

HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆233Updated last month
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆136Updated last year
athms / mad-lab
A MAD laboratory to improve AI architecture designs 🧪
☆131Updated 10 months ago
microsoft / mutransformers
some common Huggingface transformers in maximal update parametrization (µP)
☆86Updated 3 years ago
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
EleutherAI / nanoGPT-mup
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆166Updated 3 months ago
jxiw / BiGS
Official Repository of Pretraining Without Attention (BiGS), BiGS is the first model to achieve BERT-level transfer learning on the GLUE …
☆114Updated last year
epfml / llm-baselines
nanoGPT-like codebase for LLM training
☆107Updated 5 months ago
vvvm23 / mamba-jax
Unofficial but Efficient Implementation of "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" in JAX
☆88Updated last year
princeton-nlp / TransformerPrograms
[NeurIPS 2023] Learning Transformer Programs
☆162Updated last year
berlino / seq_icl
☆53Updated last year
srush / mamba-primer
☆38Updated last year
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆52Updated 2 years ago
lucidrains / simple-hierarchical-transformer
Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT
☆222Updated last year
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆241Updated 4 months ago
srush / LLM-Talk
☆52Updated last year
xrsrke / pipegoose
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
☆87Updated last year
cloneofsimo / min-fsdp
☆91Updated last year
insuhan / hyper-attn
☆83Updated last year
lucidrains / PaLM-jax
Implementation of the specific Transformer architecture from PaLM - Scaling Language Modeling with Pathways - in Jax (Equinox framework)
☆188Updated 3 years ago
lucidrains / CoLT5-attention
Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch
☆230Updated last year
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆84Updated 11 months ago
Sea-Snell / JAX_llama
Inference code for LLaMA models in JAX
☆119Updated last year
lucidrains / taylor-series-linear-attention
Explorations into the recently proposed Taylor Series Linear Attention
☆99Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated 2 years ago
ayaka14732 / llama-2-jax
JAX implementation of the Llama 2 model
☆216Updated last year
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated 2 years ago
davisyoshida / lorax
LoRA for arbitrary JAX models and functions
☆141Updated last year
shikaiqiu / compute-better-spent
☆58Updated last year
lucidrains / mixture-of-attention
Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts
☆119Updated last year