IST-DASLab/MoE-Quant

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/IST-DASLab/MoE-Quant)

IST-DASLab / MoE-Quant

Code for data-aware compression of DeepSeek models

☆75

Alternatives and similar repositories for MoE-Quant

Users that are interested in MoE-Quant are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆191Updated this week
IST-DASLab / FP-Quant
View on GitHub
☆114Feb 26, 2026Updated 4 months ago
mit-han-lab / fouroversix
View on GitHub
Code for the papers: “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling” and “Adaptive Block-Scaled Data Types”
☆198Apr 21, 2026Updated 3 months ago
IST-DASLab / QuEST
View on GitHub
Work in progress.
☆80Nov 25, 2025Updated 7 months ago
IST-DASLab / Quartet
View on GitHub
☆127Mar 18, 2026Updated 4 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆157Aug 21, 2025Updated 11 months ago
cat538 / MxMoE
View on GitHub
[ICML 2025] MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
☆30Jul 4, 2025Updated last year
facebookresearch / SpinQuant
View on GitHub
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆415Feb 14, 2025Updated last year
Intelligent-Computing-Lab-Panda / GPTAQ
View on GitHub
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆92Jul 28, 2025Updated 11 months ago
vllm-project / compressed-tensors
View on GitHub
A safetensors extension to efficiently store sparse quantized tensors on disk
☆302Updated this week
UNITES-Lab / MoE-Quantization
View on GitHub
Official code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark"
☆30Jun 30, 2025Updated last year
ChenMnZ / INT_vs_FP
View on GitHub
[ICML 2026]A framework to compare low-bit integer and float-point formats
☆81May 6, 2026Updated 2 months ago
IST-DASLab / sparseprop
View on GitHub
☆16Sep 27, 2023Updated 2 years ago
INT-FlashAttention2024 / INT-FlashAttention
View on GitHub
☆91Jan 23, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
chenzx921020 / MoEQuant
View on GitHub
☆17Apr 7, 2025Updated last year
IST-DASLab / gptq-gguf-toolkit
View on GitHub
Efficient non-uniform quantization with GPTQ for GGUF
☆64Sep 17, 2025Updated 10 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
goodevening13 / aquakv
View on GitHub
☆21Apr 27, 2026Updated 2 months ago
ModelCloud / GPTQModel
View on GitHub
LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM…
☆1,210Updated this week
IST-DASLab / spdy
View on GitHub
Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"
☆20May 3, 2023Updated 3 years ago
Aaronhuang-778 / Mixture-Compressor-MoE
View on GitHub
[ICLR 2025, IEEE TPAMI 2026] Mixture Compressor & MC#
☆75Feb 12, 2025Updated last year
chenyu-jiang / dcp
View on GitHub
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
☆21Nov 28, 2025Updated 7 months ago
facebookresearch / ParetoQ
View on GitHub
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆131Oct 15, 2025Updated 9 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
Dao-AILab / fast-hadamard-transform
View on GitHub
Fast Hadamard transform in CUDA, with a PyTorch interface
☆340Mar 10, 2026Updated 4 months ago
IST-DASLab / llmq
View on GitHub
Quantized LLM training in pure CUDA/C++.
☆250Updated this week
sustcsonglin / fla-tilelang
View on GitHub
☆37Mar 7, 2025Updated last year
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
aiha-lab / MX-QLLM
View on GitHub
LLM Inference with Microscaling Format
☆35Nov 12, 2024Updated last year
sgl-project / sgl-flash-attn
View on GitHub
Fast and memory-efficient exact attention
☆22Jun 26, 2026Updated 3 weeks ago
ruikangliu / FlatQuant
View on GitHub
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆223Nov 25, 2025Updated 7 months ago
Supercomputing-System-AI-Lab / MiLo
View on GitHub
Code repo for efficient quantized MoE inference with mixture of low-rank compensators
☆39Apr 14, 2025Updated last year
IST-DASLab / RoSA
View on GitHub
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
☆46May 20, 2026Updated 2 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
IST-DASLab / QUIK
View on GitHub
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
☆185Apr 16, 2024Updated 2 years ago
zhangsichengsjtu / AFPQ
View on GitHub
AFPQ code implementation
☆23Nov 6, 2023Updated 2 years ago
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆170Oct 13, 2025Updated 9 months ago
ModelTC / Outlier_Suppression_Plus
View on GitHub
Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…
☆52Oct 21, 2023Updated 2 years ago
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆71Jan 15, 2026Updated 6 months ago
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
CerebrasResearch / reap
View on GitHub
REAP: Router-weighted Expert Activation Pruning for SMoE compression
☆445Apr 17, 2026Updated 3 months ago