apple / ml-sigmoid-attentionLinks

☆293

Alternatives and similar repositories for ml-sigmoid-attention

Users that are interested in ml-sigmoid-attention are comparing it to the libraries listed below

Sorting:

NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆192Updated 4 months ago
tensorgi / TPA
The official implementation of TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)
☆380Updated last week
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆185Updated 8 months ago
jxiw / MambaInLlama
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
☆225Updated 3 months ago
kyleliang919 / C-Optim
When it comes to optimizers, it's always better to be safe than sorry
☆344Updated this week
lucidrains / nGPT-pytorch
Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI
☆288Updated 2 months ago
NVlabs / hymba
☆190Updated 7 months ago
Haiyang-W / TokenFormer
[ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
☆567Updated 5 months ago
goombalab / hnet
H-Net: Hierarchical Network with Dynamic Chunking
☆593Updated 3 weeks ago
HanGuo97 / log-linear-attention
☆228Updated last month
test-time-training / ttt-lm-jax
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
☆418Updated 11 months ago
bluorion-com / ZClip
Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".
☆131Updated last month
goombalab / phi-mamba
Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…
☆113Updated 10 months ago
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆127Updated 11 months ago
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
nanowell / AdEMAMix-Optimizer-Pytorch
The AdEMAMix Optimizer: Better, Faster, Older.
☆184Updated 10 months ago
apple / ml-cross-entropy
☆506Updated last week
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆209Updated last month
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆111Updated 7 months ago
llm-random / llm-random
☆192Updated last week
raymin0223 / mixture_of_recursions
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
☆314Updated last week
lucidrains / soft-moe-pytorch
Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch
☆309Updated 4 months ago
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆238Updated last month
zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
☆431Updated 2 months ago
goombalab / hydra
Official implementation of "Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers"
☆150Updated 6 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆244Updated 6 months ago
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆532Updated 2 months ago
HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆221Updated 3 weeks ago
facebookresearch / spdl
Scalable and Performant Data Loading
☆290Updated last week