BryceZhuo / HybridNormLinks
The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
β17Updated 8 months ago
Alternatives and similar repositories for HybridNorm
Users that are interested in HybridNorm are comparing it to the libraries listed below
Sorting:
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β89Updated last year
- Code for paper "Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning"β83Updated last year
- Code for paper "Patch-Level Training for Large Language Models"β92Updated last year
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformerβ63Updated 2 years ago
- FocusLLM: Scaling LLMβs Context by Parallel Decodingβ43Updated 11 months ago
- [EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokensβ25Updated 2 years ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswalβ¦β55Updated 2 years ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"β39Updated last year
- [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance samplingβ87Updated 2 years ago
- [NeurIPS 2023] Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuningβ31Updated 2 years ago
- β16Updated last year
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Modelsβ55Updated 9 months ago
- The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sinkβ¦β101Updated last month
- PyTorch implementation of StableMask (ICML'24)β14Updated last year
- Mixture of Attention Headsβ51Updated 3 years ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"β104Updated last month
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxiaβ¦β27Updated 3 months ago
- β50Updated 2 years ago
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingβ40Updated last month
- β56Updated last year
- β11Updated last year
- Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.β31Updated 8 months ago
- β27Updated last month
- Large Language Models Can Self-Improve in Long-context Reasoningβ73Updated 11 months ago
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Modelsβ78Updated last year
- The code and data for the paper JiuZhang3.0β49Updated last year
- β106Updated last month
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"β27Updated last year
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewardsβ44Updated 7 months ago
- β61Updated 3 weeks ago