saic-fi / MobileQuantLinks

[EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models

☆68

Alternatives and similar repositories for MobileQuant

Users that are interested in MobileQuant are comparing it to the libraries listed below

Sorting:

wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆182Updated 7 months ago
Macaronlin / LLaMA3-Quantization
A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..
☆197Updated 10 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆146Updated 3 months ago
casper-hansen / AutoAWQ_kernels
☆78Updated last year
MollySophia / rwkv-qualcomm
Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK
☆87Updated last month
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆197Updated 2 years ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆109Updated 7 months ago
daquexian / faster-rwkv
☆125Updated last year
powerserve-project / PowerServe
High-speed and easy-use LLM serving framework for local deployment
☆137Updated 3 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆210Updated last week
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆218Updated last year
Adlik / smoothquantplus
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆23Updated last year
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆227Updated 10 months ago
mlc-ai / llm-perf-bench
☆120Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
mlc-ai / relax
☆170Updated 2 weeks ago
Efficient-ML / Qwen3-Quantization
☆63Updated 2 months ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 4 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆312Updated 6 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆129Updated last year
GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆110Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆159Updated 2 years ago
microsoft / AttentionEngine
☆111Updated 6 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆331Updated last year
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆113Updated last month
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆166Updated this week
haochengxi / Train_Transformers_with_INT4
☆157Updated 2 years ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆112Updated 2 years ago