nanowell / Differential-Transformer-PyTorch

PyTorch implementation of the Differential-Transformer architecture for sequence modeling, specifically tailored as a decoder-only model similar to large language models (LLMs). The architecture incorporates a novel Differential Attention mechanism, Multi-Head structure, RMSNorm, and SwiGLU.
43Updated 3 weeks ago

Related projects

Alternatives and complementary repositories for Differential-Transformer-PyTorch