nengwp / Lion-vs-AdamLinks

Lion and Adam optimization comparison

☆64

Alternatives and similar repositories for Lion-vs-Adam

Users that are interested in Lion-vs-Adam are comparing it to the libraries listed below

Sorting:

bojone / tiger
A Tight-fisted Optimizer
☆50Updated 2 years ago
ZhuiyiTechnology / GAU-alpha
基于Gated Attention Unit的Transformer模型（尝鲜版）
☆98Updated 2 years ago
OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆247Updated last year
DRSY / EMO
[ICLR 2024]EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling(https://arxiv.org/abs/2310.04691)
☆125Updated last year
YuchuanTian / RethinkTinyLM
[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”
☆123Updated 8 months ago
jiahe7ay / infini-mini-transformer
This is a personal reimplementation of Google's Infini-transformer, utilizing a small 2b model. The project includes both model and train…
☆58Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆131Updated 2 years ago
cofe-ai / Mu-scaling
Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales
☆32Updated 2 years ago
Outsider565 / LoRA-GA
☆210Updated 11 months ago
bojone / rerope
Rectified Rotary Position Embeddings
☆381Updated last year
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆102Updated last year
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆180Updated last year
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆327Updated 7 months ago
kyegomez / AttentionIsOFFByOne
Implementation of "Attention Is Off By One" by Evan Miller
☆197Updated 2 years ago
OpenNLPLab / Transnormer
[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer
☆63Updated 2 years ago
CoinCheung / gdGPT
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
☆98Updated last year
princeton-nlp / CoFiPruning
[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
☆197Updated 2 years ago
OpenBMB / BMCook
Model Compression for Big Models
☆165Updated 2 years ago
berlino / gated_linear_attention
☆105Updated last year
transformer-vq / transformer_vq
☆197Updated last year
DAMO-NLP-SG / CLEX
[ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models
☆78Updated last year
zms1999 / SmartMoE
A MoE impl for PyTorch, [ATC'23] SmartMoE
☆70Updated 2 years ago
Oneflow-Inc / one-glm
A more efficient GLM implementation!
☆54Updated 2 years ago
TemporaryLoRA / Temp-LoRA
☆115Updated last year
SparkJiao / llama-pipeline-parallel
A prototype repo for hybrid training of pipeline parallel and distributed data parallel with comments on core code snippets. Feel free to…
☆57Updated 2 years ago
hazdzz / tiger
A Tight-fisted Optimizer (Tiger), implemented in PyTorch.
☆12Updated last year
THUDM / icetk
A unified tokenization tool for Images, Chinese and English.
☆151Updated 2 years ago
OpenLMLab / scaling-rope
code for Scaling Laws of RoPE-based Extrapolation
☆73Updated last year
HKUNLP / efficient-attention
[EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling
☆86Updated 2 years ago
Spico197 / watchmen
😎 A simple and easy-to-use toolkit for GPU scheduling.
☆46Updated 4 months ago