usyd-fsalab / fp6_llmLinks

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

☆266

Alternatives and similar repositories for fp6_llm

Users that are interested in fp6_llm are comparing it to the libraries listed below

Sorting:

efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆117Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆253Updated last week
ColfaxResearch / cutlass-kernels
☆241Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆220Updated 2 years ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆282Updated last year
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆385Updated last week
AlibabaPAI / FLASHNN
☆100Updated last year
microsoft / SparTA
☆153Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated this week
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆703Updated 2 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
FMInference / DejaVu
☆343Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆119Updated 3 weeks ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
yifuwang / symm-mem-recipes
☆141Updated 9 months ago
InternLM / turbomind
☆97Updated 7 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 8 months ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆102Updated 7 years ago
neuralmagic / AutoFP8
☆205Updated 5 months ago
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆156Updated 2 weeks ago
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 9 months ago