hazdzz / tiger
A Tight-fisted Optimizer (Tiger), implemented in PyTorch.
☆11Updated 9 months ago
Alternatives and similar repositories for tiger:
Users that are interested in tiger are comparing it to the libraries listed below
- A Tight-fisted Optimizer☆47Updated 2 years ago
- Lion and Adam optimization comparison☆60Updated 2 years ago
- [ICLR 2024]EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling(https://arxiv.org/abs/2310.04691)☆120Updated last year
- Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales☆32Updated last year
- [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling☆82Updated 2 years ago
- Code for preprint "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"☆36Updated last week
- ICLR2023 - Tailoring Language Generation Models under Total Variation Distance☆21Updated 2 years ago
- Implementation of "Decoding-time Realignment of Language Models", ICML 2024.☆18Updated 9 months ago
- ☆14Updated last year
- A collection of instruction data and scripts for machine translation.☆20Updated last year
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer☆60Updated last year
- Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"☆40Updated 2 years ago
- This repository is the official implementation of our EMNLP 2022 paper ELMER: A Non-Autoregressive Pre-trained Language Model for Efficie…☆26Updated 2 years ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated 2 years ago
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆46Updated last month
- Mixture of Attention Heads☆43Updated 2 years ago
- Official code for ICLR 2022 paper: "PoNet: Pooling Network for Efficient Token Mixing in Long Sequences".☆31Updated last year
- An Experiment on Dynamic NTK Scaling RoPE☆62Updated last year
- ☆13Updated 5 months ago
- This is a personal reimplementation of Google's Infini-transformer, utilizing a small 2b model. The project includes both model and train…☆56Updated 11 months ago
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models☆76Updated last year
- ☆102Updated last year
- Code for paper "Patch-Level Training for Large Language Models"☆81Updated 4 months ago
- [NeurIPS 2022] "A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models", Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li…☆21Updated last year
- Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method ; GKD: A General Knowledge Distillation…☆32Updated last year
- ☆14Updated 2 years ago
- [EMNLP 2022] Differentiable Data Augmentation for Contrastive Sentence Representation Learning. https://arxiv.org/abs/2210.16536☆39Updated 2 years ago
- 基于Gated Attention Unit的Transformer模型(尝鲜版)☆97Updated 2 years ago
- Official Code for 'EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification' - NAACL 2022☆23Updated 2 years ago
- ☆24Updated 2 years ago