IST-DASLab/qmoe

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/IST-DASLab/qmoe)

IST-DASLab / qmoe

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

☆278

Alternatives and similar repositories for qmoe

Users that are interested in qmoe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
HanGuo97 / lq-lora
View on GitHub
☆129Jan 22, 2024Updated 2 years ago
hahnyuan / PB-LLM
View on GitHub
PB-LLM: Partially Binarized Large Language Models
☆158Nov 20, 2023Updated 2 years ago
IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,333Mar 27, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
dropbox / hqq
View on GitHub
Official implementation of Half-Quadratic Quantization (HQQ)
☆949Feb 26, 2026Updated 4 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
FasterDecoding / SnapKV
View on GitHub
☆324Jul 10, 2025Updated last year
IBM / ModuleFormer
View on GitHub
ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…
☆225Sep 18, 2025Updated 10 months ago
IST-DASLab / sparsegpt
View on GitHub
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
☆889Aug 20, 2024Updated last year
Cornell-RelaxML / QuIP
View on GitHub
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆399Feb 24, 2024Updated 2 years ago
Cornell-RelaxML / quip-sharp
View on GitHub
☆600Oct 29, 2024Updated last year
Azure / MS-AMP
View on GitHub
Microsoft Automatic Mixed Precision Library
☆636Dec 1, 2025Updated 7 months ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
schwartz-lab-NLP / TOVA
View on GitHub
Token Omission Via Attention
☆131Oct 13, 2024Updated last year
IST-DASLab / QUIK
View on GitHub
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
☆185Apr 16, 2024Updated 2 years ago
HanGuo97 / flute
View on GitHub
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆391Apr 13, 2025Updated last year
yuhuixu1993 / qa-lora
View on GitHub
Official PyTorch implementation of QA-LoRA
☆147Mar 13, 2024Updated 2 years ago
IST-DASLab / RoSA
View on GitHub
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
☆46May 20, 2026Updated 2 months ago
IST-DASLab / SparseFinetuning
View on GitHub
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆43Jan 15, 2024Updated 2 years ago
facebookresearch / Ternary_Binary_Transformer
View on GitHub
ACL 2023
☆39Jun 6, 2023Updated 3 years ago
RobertCsordas / moe
View on GitHub
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆39Jun 11, 2025Updated last year
Vahe1994 / SpQR
View on GitHub
☆554Feb 8, 2026Updated 5 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
fishiatee / Tumera
View on GitHub
Yet another frontend for LLM, written using .NET and WinUI 3
☆11Sep 14, 2025Updated 10 months ago
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆901Nov 26, 2025Updated 7 months ago
IST-DASLab / torch_cgx
View on GitHub
Pytorch distributed backend extension with compression support
☆17Mar 24, 2025Updated last year
IST-DASLab / sparseprop
View on GitHub
☆16Sep 27, 2023Updated 2 years ago
IST-DASLab / OBC
View on GitHub
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
☆132Jul 11, 2023Updated 3 years ago
datamllab / LongLM
View on GitHub
[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
☆668Jun 1, 2024Updated 2 years ago
XueFuzhao / OpenMoE
View on GitHub
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,691Mar 8, 2024Updated 2 years ago
hao-ai-lab / LookaheadDecoding
View on GitHub
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,340Mar 6, 2025Updated last year
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆343Jul 2, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
SafeAILab / EAGLE
View on GitHub
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
☆2,468Feb 20, 2026Updated 5 months ago
FasterDecoding / TEAL
View on GitHub
☆167Feb 15, 2025Updated last year
xuyuzhuang11 / OneBit
View on GitHub
The homepage of OneBit model quantization framework.
☆205Feb 5, 2025Updated last year
microsoft / TransformerCompression
View on GitHub
For releasing code related to compression methods for transformers, accompanying our publications
☆461Jan 16, 2025Updated last year
nbasyl / LLM-FP4
View on GitHub
The official implementation of the EMNLP 2023 paper LLM-FP4
☆224Dec 15, 2023Updated 2 years ago
xjdr-alt / mla_blog_translation
View on GitHub
☆13Jun 18, 2024Updated 2 years ago
punica-ai / punica
View on GitHub
Serving multiple LoRA finetuned LLM as one
☆1,166May 8, 2024Updated 2 years ago