ModelTC / llmcLinks

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

☆510

Alternatives and similar repositories for llmc

Users that are interested in llmc are comparing it to the libraries listed below

Sorting:

mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆714Updated 4 months ago
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆506Updated 10 months ago
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆617Updated this week
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆141Updated last month
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆404Updated 7 months ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆302Updated 4 months ago
FlagOpen / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆321Updated this week
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆133Updated 3 months ago
pprp / Awesome-LLM-Quantization
Awesome list for LLM quantization
☆251Updated last month
facebookresearch / SpinQuant
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆298Updated 4 months ago
feifeibear / LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
☆773Updated 10 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆141Updated last year
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆272Updated last year
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆102Updated 3 months ago
OpenPPL / ppl.llm.serving
☆128Updated 6 months ago
FMInference / DejaVu
☆330Updated last year
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆475Updated this week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆855Updated 10 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆380Updated last month
madsys-dev / deepseekv2-profile
☆142Updated 4 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆97Updated 3 weeks ago
SafeAILab / EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
☆1,384Updated this week
OpenPPL / ppl.nn.llm
☆139Updated last year
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆129Updated 10 months ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆206Updated last year
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆459Updated 11 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆835Updated last month
LLMServe / DistServe
Disaggregated serving system for Large Language Models (LLMs).
☆639Updated 3 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆313Updated last year
luchangli03 / export_llama_to_onnx
export llama to onnx
☆128Updated 6 months ago