RobertCsordas / moe_layerLinks

sigma-MoE layer

☆20

Alternatives and similar repositories for moe_layer

Users that are interested in moe_layer are comparing it to the libraries listed below

Sorting:

RobertCsordas / moe
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆38Updated 5 months ago
sustcsonglin / gated_linear_attention_layer
☆32Updated last year
glassroom / heinsen_attention
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
☆24Updated last year
sustcsonglin / mamba-triton
☆50Updated last year
jenni-ai / T2FW
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
☆19Updated 3 years ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆62Updated last year
renll / SeqBoat
[NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling
☆40Updated 2 years ago
HazyResearch / prefix-linear-attention
☆57Updated last year
yikangshen / megablocks
☆20Updated last year
OpenNLPLab / HGRN
[NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Se…
☆66Updated last year
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
nikhilvyas / SOAP_MUON
Combining SOAP and MUON
☆17Updated 9 months ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆49Updated 2 years ago
codekansas / rwkv
RWKV model implementation
☆38Updated 2 years ago
acosharma / elita-transformer
Official Repository for Efficient Linear-Time Attention Transformers.
☆18Updated last year
berlino / gated_linear_attention
☆106Updated last year
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆81Updated 2 years ago
EleutherAI / rnngineering
Engineering the state of RNN language models (Mamba, RWKV, etc.)
☆32Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
lsj2408 / URPE
[NeurIPS 2022] Your Transformer May Not be as Powerful as You Expect (official implementation)
☆34Updated 2 years ago
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆53Updated 2 years ago
proger / hippogriff
Griffin MQA + Hawk Linear RNN Hybrid
☆89Updated last year
lucidrains / memory-editable-transformer
My explorations into editing the knowledge and memories of an attention network
☆35Updated 3 years ago
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆138Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆73Updated last year
berlino / seq_icl
☆53Updated last year
chijames / KERPLE
☆20Updated 3 years ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆29Updated last year
lucidrains / token-shift-gpt
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing
☆50Updated 3 years ago