dmis-lab / Outlier-Safe-Pre-TrainingLinks

[ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

☆34

Alternatives and similar repositories for Outlier-Safe-Pre-Training

Users that are interested in Outlier-Safe-Pre-Training are comparing it to the libraries listed below

Sorting:

sustcsonglin / mamba-triton
☆49Updated last year
berlino / seq_icl
☆53Updated last year
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆103Updated last month
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
insuhan / hyper-attn
☆83Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆206Updated 5 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆91Updated 4 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 11 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
amirzandieh / HyperAttention
Triton Implementation of HyperAttention Algorithm
☆48Updated last year
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆120Updated 11 months ago
mgmalek / efficient_cross_entropy
☆121Updated last year
test-time-training / ttt-tk
☆41Updated 3 weeks ago
HazyResearch / prefix-linear-attention
☆57Updated last year
martin-marek / batch-size
📄Small Batch Size Training for Language Models
☆63Updated last month
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 4 months ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆61Updated last year
RobertCsordas / moeut
☆88Updated last year
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆46Updated 11 months ago
athms / mad-lab
A MAD laboratory to improve AI architecture designs 🧪
☆133Updated 11 months ago
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆70Updated last year
NVIDIA-NeMo / Emerging-Optimizers
☆66Updated this week
samsja / muon_fsdp_2
Muon fsdp 2
☆45Updated 3 months ago
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆243Updated 5 months ago
epfml / dynamic-sparse-flash-attention
☆150Updated 2 years ago
google-deepmind / asyncdiloco
☆47Updated last year
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆74Updated 8 months ago
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year