IST-DASLab/Sparse-Marlin

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/IST-DASLab/Sparse-Marlin)

IST-DASLab / Sparse-Marlin

Boosting 4-bit inference kernels with 2:4 Sparsity

☆96

Alternatives and similar repositories for Sparse-Marlin

Users that are interested in Sparse-Marlin are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆157Aug 21, 2025Updated 10 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
IST-DASLab / SparseFinetuning
View on GitHub
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆43Jan 15, 2024Updated 2 years ago
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 7 months ago
LeiWang1999 / Stream-k.tvm
View on GitHub
☆20Sep 28, 2024Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
HanGuo97 / flute
View on GitHub
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆391Apr 13, 2025Updated last year
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆343Jul 2, 2024Updated 2 years ago
AlibabaPAI / FLASHNN
View on GitHub
☆106Sep 9, 2024Updated last year
IST-DASLab / Quartet
View on GitHub
☆126Mar 18, 2026Updated 4 months ago
neuralmagic / nm-vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆267Dec 4, 2025Updated 7 months ago
INT-FlashAttention2024 / INT-FlashAttention
View on GitHub
☆91Jan 23, 2025Updated last year
SqueezeBits / QUICK
View on GitHub
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆123Mar 6, 2024Updated 2 years ago
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆477Updated this week
neuralmagic / AutoFP8
View on GitHub
☆210May 5, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
IST-DASLab / QuEST
View on GitHub
Work in progress.
☆80Nov 25, 2025Updated 7 months ago
Cornell-RelaxML / qtip
View on GitHub
☆180Jun 22, 2025Updated last year
IST-DASLab / qmoe
View on GitHub
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆278Nov 3, 2023Updated 2 years ago
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆430Aug 13, 2024Updated last year
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆191Updated this week
Bruce-Lee-LY / cuda_hgemv
View on GitHub
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆75Sep 8, 2024Updated last year
NetEase-FuXi / EETQ
View on GitHub
Easy and Efficient Quantization for Transformers
☆205Mar 25, 2026Updated 3 months ago
vllm-project / compressed-tensors
View on GitHub
A safetensors extension to efficiently store sparse quantized tensors on disk
☆302Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
wangsiping97 / FastGEMV
View on GitHub
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆129Jul 13, 2024Updated 2 years ago
AniZpZ / AutoSmoothQuant
View on GitHub
An easy-to-use package for implementing SmoothQuant for LLMs
☆111Apr 7, 2025Updated last year
linxihui / dkernel
View on GitHub
☆22Apr 17, 2025Updated last year
FasterDecoding / TEAL
View on GitHub
☆167Feb 15, 2025Updated last year
ROCm / tensorcast
View on GitHub
☆18Nov 10, 2025Updated 8 months ago
IST-DASLab / HALO
View on GitHub
HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…
☆31Feb 17, 2025Updated last year
SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
usyd-fsalab / fp6_llm
View on GitHub
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆282Jul 16, 2025Updated last year
Dao-AILab / fast-hadamard-transform
View on GitHub
Fast Hadamard transform in CUDA, with a PyTorch interface
☆338Mar 10, 2026Updated 4 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
facebookresearch / SpinQuant
View on GitHub
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆415Feb 14, 2025Updated last year
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆523Nov 26, 2024Updated last year
WaveSpeedAI / QuantumAttention
View on GitHub
[WIP] Better (FP8) attention for Hopper
☆33Feb 24, 2025Updated last year
Qualcomm-AI-research / lr-qat
View on GitHub
☆54Nov 5, 2024Updated last year
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
VPeterV / RankSpace-Models
View on GitHub
source code for NAACL2022 main conference "Dynamic Programming in Rank Space: Scaling Structured Inference with Low-Rank HMMs and PCFGs"
☆10Sep 26, 2022Updated 3 years ago
Cornell-RelaxML / quip-sharp
View on GitHub
☆600Oct 29, 2024Updated last year