IST-DASLab / QuartetLinks

☆102

Alternatives and similar repositories for Quartet

Users that are interested in Quartet are comparing it to the libraries listed below

Sorting:

Cornell-RelaxML / qtip
☆152Updated 4 months ago
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
Anonymous1252022 / fp4-all-the-way
☆35Updated 5 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆100Updated 6 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆110Updated last year
gau-nernst / quantized-training
Explore training for quantized models
☆25Updated 3 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆119Updated 3 weeks ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆45Updated 3 months ago
Cornell-RelaxML / yaqa-quantization
☆60Updated 4 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆101Updated last week
FasterDecoding / TEAL
☆145Updated 8 months ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
IST-DASLab / MoE-Quant
Code for data-aware compression of DeepSeek models
☆56Updated 4 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆147Updated last week
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
IST-DASLab / MicroAdam
This repository contains code for the MicroAdam paper.
☆19Updated 10 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 9 months ago
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated 2 weeks ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆195Updated 4 months ago
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆238Updated 11 months ago
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆108Updated last week