CLAIRE-Labo / StructuredFFNLinks

The official code of "Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers"

☆19

Alternatives and similar repositories for StructuredFFN

Users that are interested in StructuredFFN are comparing it to the libraries listed below

Sorting:

berlino / seq_icl
☆53Updated last year
RobertCsordas / moe
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆38Updated 4 months ago
proger / hippogriff
Griffin MQA + Hawk Linear RNN Hybrid
☆89Updated last year
NX-AI / mlstm_kernels
Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.
☆72Updated last week
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆69Updated last year
sustcsonglin / mamba-triton
☆48Updated last year
shikaiqiu / compute-better-spent
☆60Updated last year
fal-ai-community / NativeSparseAttention
research impl of Native Sparse Attention (2502.11089)
☆62Updated 8 months ago
proger / accelerated-scan
Accelerated First Order Parallel Associative Scan
☆189Updated last year
huyphan168 / PEER
Mixture of A Million Experts
☆48Updated last year
jopetty / word-problem
Experiments on the impact of depth in transformers and SSMs.
☆36Updated last week
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆132Updated 3 months ago
kjslag / spacebyte
A byte-level decoder architecture that matches the performance of tokenized Transformers.
☆66Updated last year
cat-state / tinypar
☆20Updated 2 years ago
apple / ml-ademamix
☆68Updated 11 months ago
LIONS-EPFL / scion
☆45Updated last week
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
ethansmith2000 / TransformerExperiments
☆19Updated 5 months ago
erogol / BlaGPT
Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible…
☆84Updated 2 weeks ago
cloneofsimo / min-max-gpt
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Updated last year
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆129Updated last year
proger / nanokitchen
Parallel Associative Scan for Language Models
☆17Updated last year
nikhilvyas / SOAP_MUON
Combining SOAP and MUON
☆16Updated 8 months ago
mgmalek / efficient_cross_entropy
☆121Updated last year
shreyansh26 / An-Empirical-Model-of-Large-Batch-Training
An approximate implementation of the OpenAI paper - An Empirical Model of Large-Batch Training for MNIST
☆11Updated 2 years ago
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆136Updated last year
athms / mad-lab
A MAD laboratory to improve AI architecture designs 🧪
☆132Updated 10 months ago
lucidrains / pause-transformer
Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…
☆52Updated 2 years ago
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆241Updated 4 months ago
dvruette / barrel-rec-pytorch
☆53Updated last year