huyz2023 / 2by4-pretrain

Efficient 2:4 sparse training algorithms and implementations

☆46

Alternatives and similar repositories for 2by4-pretrain:

Users that are interested in 2by4-pretrain are comparing it to the libraries listed below

thu-ml / Jetfire-INT8Training
☆30Updated 6 months ago
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆98Updated 3 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆72Updated 2 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆152Updated 6 months ago
LiuXiaoxuanPKU / GACT-ICML
☆41Updated 2 years ago
INT-FlashAttention2024 / INT-FlashAttention
☆58Updated last week
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆55Updated 3 months ago
bytedance / AffineQuant
Official implementation of the ICLR 2024 paper AffineQuant
☆24Updated 10 months ago
ScalingIntelligence / CATS
☆23Updated 2 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆235Updated 2 months ago
shadowpa0327 / Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
☆63Updated 2 months ago
ChenMnZ / PrefixQuant
An algorithm for static activation quantization of LLMs
☆111Updated 2 weeks ago
FasterDecoding / TEAL
☆108Updated 4 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆34Updated 2 months ago
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆60Updated 9 months ago
Aaronhuang-778 / SliM-LLM
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
☆26Updated 5 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆34Updated 2 months ago
ruikangliu / FlatQuant
Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization
☆95Updated last week
Hsu1023 / DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.
☆138Updated 3 months ago
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆114Updated 4 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆135Updated 8 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆157Updated last year
NVlabs / COAT
☆55Updated 2 weeks ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆27Updated this week
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆37Updated 6 months ago
LiuXiaoxuanPKU / OSD
☆40Updated last month
thu-ml / 2by4-pretrain-acc-examples
Code for "Accelerating Transformer Pre-training with 2:4 Sparsity"
☆17Updated last month
cat538 / SKVQ
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆13Updated 3 months ago
mit-han-lab / VisCompare
A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders
☆16Updated this week
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆61Updated 2 months ago