haochengxi / Train_Transformers_with_INT4Links

☆156

Alternatives and similar repositories for Train_Transformers_with_INT4

Users that are interested in Train_Transformers_with_INT4 are comparing it to the libraries listed below

Sorting:

nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆194Updated 2 years ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
IST-DASLab / OBC
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
☆129Updated 2 years ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆122Updated last year
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆160Updated 5 months ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆317Updated 7 months ago
FasterDecoding / TEAL
☆145Updated 8 months ago
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆242Updated 2 months ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
Macaronlin / LLaMA3-Quantization
A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..
☆195Updated 9 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆228Updated 9 months ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆220Updated 2 years ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆60Updated last year
yxli2123 / LoftQ
☆230Updated last year
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆94Updated 10 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆108Updated 2 years ago
stanford-futuredata / stk
☆112Updated last year
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆80Updated last year
kyegomez / FlashAttention20Triton
Triton implementation of Flash Attention2.0
☆40Updated 2 years ago
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆132Updated last year
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆433Updated 10 months ago
IST-DASLab / HALO
HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…
☆26Updated 8 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 6 months ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago