zqOuO / GWT
☆12Updated 3 months ago
Alternatives and similar repositories for GWT:
Users that are interested in GWT are comparing it to the libraries listed below
- ☆28Updated last month
- Unofficial Implementation of Selective Attention Transformer☆16Updated 6 months ago
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆30Updated 6 months ago
- This repository contains code for the MicroAdam paper.☆18Updated 4 months ago
- Work in progress.☆58Updated last month
- Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"☆15Updated 2 months ago
- [ICLR 2023] Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation☆12Updated last year
- Triton Implementation of HyperAttention Algorithm☆47Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆100Updated this week
- An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace☆16Updated 6 months ago
- ☆52Updated 11 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆27Updated 7 months ago
- ☆53Updated 7 months ago
- ☆18Updated last year
- Here we will test various linear attention designs.☆60Updated last year
- ☆78Updated 8 months ago
- supporting pytorch FSDP for optimizers☆80Updated 5 months ago
- Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"☆26Updated 2 weeks ago
- Latest Weight Averaging (NeurIPS HITY 2022)☆30Updated last year
- ☆25Updated 5 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆65Updated 9 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆66Updated 6 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆97Updated last month
- ☆27Updated 9 months ago
- Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation (ICML'24 Oral)☆14Updated 9 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆80Updated last year
- ☆14Updated 2 months ago
- Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"☆31Updated 2 weeks ago
- Using FlexAttention to compute attention with different masking patterns☆43Updated 7 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆66Updated 7 months ago