RobertCsordas / moeLinks

Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"

☆38

Alternatives and similar repositories for moe

Users that are interested in moe are comparing it to the libraries listed below

Sorting:

amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆102Updated last year
RobertCsordas / moeut
☆89Updated last year
RobertCsordas / moe_layer
sigma-MoE layer
☆20Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
epfml / DenseFormer
☆82Updated last year
recursal / GoldFinch-paper
GoldFinch and other hybrid transformer components
☆45Updated last year
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated 2 years ago
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆53Updated 2 years ago
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆46Updated 11 months ago
kaiokendev / cutoff-len-is-context-len
Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit
☆63Updated 2 years ago
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆138Updated last year
sustcsonglin / mamba-triton
☆50Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
google-deepmind / asyncdiloco
☆47Updated last year
dvruette / barrel-rec-pytorch
☆53Updated last year
HazyResearch / prefix-linear-attention
☆57Updated last year
EleutherAI / rnngineering
Engineering the state of RNN language models (Mamba, RWKV, etc.)
☆32Updated last year
kyegomez / Infini-attention
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…
☆57Updated last week
r-three / RAD
Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
☆45Updated 2 months ago
berlino / seq_icl
☆53Updated last year
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆131Updated last month
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated last year
proger / hippogriff
Griffin MQA + Hawk Linear RNN Hybrid
☆89Updated last year
eth-easl / fmengine
Utilities for Training Very Large Models
☆58Updated last year
OpenNLPLab / HGRN2
HGRN2: Gated Linear RNNs with State Expansion
☆55Updated last year
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆49Updated 2 years ago
codekansas / rwkv
RWKV model implementation
☆38Updated 2 years ago