BryceZhuo / HybridNormLinks
The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
β17Updated 9 months ago
Alternatives and similar repositories for HybridNorm
Users that are interested in HybridNorm are comparing it to the libraries listed below
Sorting:
- β16Updated last year
- [NeurIPS-2024] π Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623β89Updated last year
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Modelsβ56Updated 10 months ago
- Code for paper "Patch-Level Training for Large Language Models"β96Updated last month
- FocusLLM: Scaling LLMβs Context by Parallel Decodingβ44Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Schedulingβ40Updated 2 months ago
- Code for paper "Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning"β84Updated last year
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewardsβ46Updated 8 months ago
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Modelsβ78Updated last year
- Large Language Models Can Self-Improve in Long-context Reasoningβ73Updated last year
- [ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoningβ52Updated last year
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswalβ¦β56Updated 2 years ago
- This repo contains code and data for ICLR 2025 paper MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMsβ36Updated 9 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"β39Updated last year
- The code and data for the paper JiuZhang3.0β49Updated last year
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"β27Updated last year
- [ACL 2025] Are Your LLMs Capable of Stable Reasoning?β32Updated 4 months ago
- [EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokensβ25Updated 2 years ago
- [ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment (https://arxiv.org/abs/2410.02197)β37Updated 3 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"β107Updated 2 months ago
- Mixture of Attention Headsβ51Updated 3 years ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear atβ¦β104Updated last year
- β114Updated 3 months ago
- β29Updated last year
- Codebase for Instruction Following without Instruction Tuningβ36Updated last year
- Directional Preference Alignmentβ58Updated last year
- β19Updated 11 months ago
- β20Updated 9 months ago
- The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":β44Updated last year
- β14Updated 2 years ago