bzhangGo / rmsnorm
Root Mean Square Layer Normalization
☆237Updated 2 years ago
Alternatives and similar repositories for rmsnorm:
Users that are interested in rmsnorm are comparing it to the libraries listed below
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆162Updated 11 months ago
- Code for the ALiBi method for transformer language models (ICLR 2022)☆521Updated last year
- Sequence modeling with Mega.☆295Updated 2 years ago
- Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch☆226Updated 7 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆278Updated last month
- Recurrent Memory Transformer☆149Updated last year
- [ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408☆195Updated last year
- Implementation of Linformer for Pytorch☆279Updated last year
- Official implementation of TransNormerLLM: A Faster and Better LLM☆243Updated last year
- Official PyTorch Implementation of Long-Short Transformer (NeurIPS 2021).☆225Updated 3 years ago
- Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"☆376Updated last year
- Large Context Attention☆703Updated 2 months ago
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆213Updated 2 years ago
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch☆328Updated 10 months ago
- ☆290Updated 4 months ago
- Rectified Rotary Position Embeddings☆365Updated 11 months ago
- Implementation of Recurrent Memory Transformer, Neurips 2022 paper, in Pytorch☆407Updated 3 months ago
- ☆143Updated last year
- ☆219Updated 10 months ago
- Implementation of a Transformer, but completely in Triton☆263Updated 3 years ago
- ☆146Updated last year
- ☆202Updated 2 years ago
- An implementation of local windowed attention for language modeling☆440Updated 3 months ago
- Tutel MoE: Optimized Mixture-of-Experts Library, Support DeepSeek FP8/FP4☆800Updated this week
- ☆102Updated last year
- Randomized Positional Encodings Boost Length Generalization of Transformers☆80Updated last year
- [ICLR 2022] Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention☆190Updated 2 years ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆278Updated last month
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch☆665Updated 4 months ago
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆102Updated last year