nanowell / Differential-Transformer-PyTorchLinks

PyTorch implementation of the Differential-Transformer architecture for sequence modeling, specifically tailored as a decoder-only model similar to large language models (LLMs). The architecture incorporates a novel Differential Attention mechanism, Multi-Head structure, RMSNorm, and SwiGLU.

☆77

Alternatives and similar repositories for Differential-Transformer-PyTorch

Users that are interested in Differential-Transformer-PyTorch are comparing it to the libraries listed below

Sorting:

tommyip / mamba2-minimal
Minimal Mamba-2 implementation in PyTorch
☆226Updated last year
kyegomez / Griffin
Implementation of Griffin from the paper: "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models"
☆56Updated this week
kyegomez / MambaTransformer
Integrating Mamba/SSMs with Transformer for Enhanced Long Context and High-Quality Sequence Modeling
☆207Updated last week
AmeenAli / HiddenMambaAttn
Official PyTorch Implementation of "The Hidden Attention of Mamba Models"
☆228Updated last week
kyegomez / SwitchTransformers
Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficien…
☆125Updated 3 weeks ago
WailordHe / DenseSSM
A repository for DenseSSMs
☆89Updated last year
kyegomez / DifferentialTransformer
An open source community implementation of the model from "DIFFERENTIAL TRANSFORMER" paper by Microsoft.
☆34Updated this week
Adamdad / rational_kat_cu
☆75Updated 8 months ago
pengzhangzhi / Awesome-Mamba
Awesome list of papers that extend Mamba to various applications.
☆138Updated 4 months ago
Hprairie / Bi-Mamba2
A Triton Kernel for incorporating Bi-Directionality in Mamba2
☆75Updated 10 months ago
Caiyun-AI / DCFormer
☆218Updated 8 months ago
MzeroMiko / mamba-mini
An efficient pytorch implementation of selective scan in one file, works with both cpu and gpu, with corresponding mathematical derivatio…
☆95Updated last week
badripatro / simba
Simba
☆214Updated last year
badripatro / mamba360
State Space Models
☆70Updated last year
kyegomez / MoE-Mamba
Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Ze…
☆112Updated this week
MambaMixer / M2
☆47Updated last year
akaashdash / kansformers
☆137Updated last year
vulus98 / Rethinking-attention
My implementation of the original transformer model (Vaswani et al.). I've additionally included the playground.py file for visualizing o…
☆44Updated 10 months ago
xmindflow / Awesome_Mamba
Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis
☆257Updated 3 months ago
hkproj / mamba-notes
Notes on the Mamba and the S4 model (Mamba: Linear-Time Sequence Modeling with Selective State Spaces)
☆172Updated last year
Caiyun-AI / MUDDFormer
☆85Updated 5 months ago
Chaos96 / fourierft
☆147Updated last year
TsinghuaC3I / Fourier-Position-Embedding
[ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
☆100Updated 4 months ago
LeapLabTHU / MLLA
Official repository of MLLA (NeurIPS 2024)
☆355Updated 3 months ago
transformer-vq / transformer_vq
☆197Updated last year
Weixin-Liang / Mixture-of-Mamba
☆50Updated 8 months ago
kyegomez / Jamba
PyTorch Implementation of Jamba: "Jamba: A Hybrid Transformer-Mamba Language Model"
☆192Updated 3 weeks ago
jindongli-Ai / LLM-Discrete-Tokenization-Survey
The official GitHub page for the survey paper "Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey". And this paper is unde…
☆65Updated 2 months ago
SJTU-DeepVisionLab / FLoRA
☆41Updated last year
goombalab / hydra
Official implementation of "Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers"
☆161Updated 8 months ago