shawntan / stickbreaking-attentionLinks

Stick-breaking attention

☆58

Alternatives and similar repositories for stickbreaking-attention

Users that are interested in stickbreaking-attention are comparing it to the libraries listed below

Sorting:

epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆77Updated 9 months ago
sustcsonglin / mamba-triton
☆49Updated last year
sjelassi / transformers_ssm_copy
☆33Updated last year
berlino / seq_icl
☆53Updated last year
qiuzh20 / gated_attention
The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
☆46Updated 2 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆118Updated last month
goombalab / phi-mamba
Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…
☆113Updated 10 months ago
PKU-ML / LongPPL
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
☆92Updated last week
Cranial-XIX / longhorn
Official PyTorch Implementation of the Longhorn Deep State Space Model
☆54Updated 8 months ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆62Updated last year
HazyResearch / prefix-linear-attention
☆56Updated last year
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
test-time-training / ttt-tk
☆39Updated 3 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆69Updated 5 months ago
RobertCsordas / moeut
☆83Updated 11 months ago
HKUNLP / diffusion-vs-ar
[ICLR 2025] Code for the paper "Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning"
☆68Updated 5 months ago
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆70Updated 3 weeks ago
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆82Updated 2 weeks ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆30Updated last year
Doraemonzzz / xmixers
Xmixers: A collection of SOTA efficient token/channel mixers
☆11Updated 3 weeks ago
OpenSparseLLMs / MoM
☆95Updated 3 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
jopetty / word-problem
Experiments on the impact of depth in transformers and SSMs.
☆32Updated 8 months ago
berlino / gated_linear_attention
☆106Updated last year
ScalingIntelligence / large_language_monkeys
☆101Updated 10 months ago
facebookresearch / PhysicsLM4
Physics of Language Models, Part 4
☆204Updated this week
srush / mamba-primer
☆37Updated last year
gregorbachmann / Next-Token-Failures
☆88Updated last year