neuralmagic / AutoFP8Links

☆205

Alternatives and similar repositories for AutoFP8

Users that are interested in AutoFP8 are comparing it to the libraries listed below

Sorting:

anyscale / llm-continuous-batching-benchmarks
☆121Updated last year
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 3 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆171Updated last week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 6 months ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
InternLM / turbomind
☆96Updated 6 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆310Updated 2 years ago
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆317Updated 7 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆428Updated last week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆248Updated last year
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆216Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆154Updated last week
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆381Updated 3 weeks ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated this week
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆626Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
triton-inference-server / vllm_backend
☆302Updated this week