nanowell / Differential-Transformer-PyTorch

PyTorch implementation of the Differential-Transformer architecture for sequence modeling, specifically tailored as a decoder-only model similar to large language models (LLMs). The architecture incorporates a novel Differential Attention mechanism, Multi-Head structure, RMSNorm, and SwiGLU.
53Updated 3 months ago

Alternatives and similar repositories for Differential-Transformer-PyTorch:

Users that are interested in Differential-Transformer-PyTorch are comparing it to the libraries listed below