dmis-lab / Outlier-Safe-Pre-TrainingLinks
[ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
☆29Updated last week
Alternatives and similar repositories for Outlier-Safe-Pre-Training
Users that are interested in Outlier-Safe-Pre-Training are comparing it to the libraries listed below
Sorting:
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆127Updated 8 months ago
- The evaluation framework for training-free sparse attention in LLMs☆88Updated last month
- ☆83Updated last year
- ☆49Updated last year
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆152Updated last month
- ☆45Updated last year
- 📄Small Batch Size Training for Language Models☆41Updated this week
- Here we will test various linear attention designs.☆62Updated last year
- ☆114Updated last year
- ☆53Updated last year
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆238Updated 2 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆149Updated last month
- Triton Implementation of HyperAttention Algorithm☆48Updated last year
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆38Updated last month
- ☆53Updated last year
- ☆83Updated 11 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆130Updated last year
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆127Updated 11 months ago
- Using FlexAttention to compute attention with different masking patterns☆44Updated 10 months ago
- ☆53Updated 10 months ago
- DPO, but faster 🚀☆44Updated 8 months ago
- A MAD laboratory to improve AI architecture designs 🧪☆123Updated 7 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆70Updated last year
- ☆39Updated 4 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- supporting pytorch FSDP for optimizers☆84Updated 8 months ago
- Mixture of A Million Experts☆46Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆71Updated last year
- Understand and test language model architectures on synthetic tasks.☆221Updated 3 weeks ago