z-lab/paroquant

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/z-lab/paroquant)

z-lab / paroquant

[ICLR 2026] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

☆325

Alternatives and similar repositories for paroquant

Users that are interested in paroquant are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

z-lab / dflash
View on GitHub
DFlash: Block Diffusion for Flash Speculative Decoding
☆5,500May 10, 2026Updated 2 months ago
mit-han-lab / fouroversix
View on GitHub
Code for the papers: “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling” and “Adaptive Block-Scaled Data Types”
☆198Apr 21, 2026Updated 2 months ago
z-lab / flash-colreduce
View on GitHub
Fast, memory-efficient attention column reduction (e.g., sum, mean, max)
☆49Feb 10, 2026Updated 5 months ago
z-lab / sparselora
View on GitHub
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆76Mar 10, 2026Updated 4 months ago
Goekdeniz-Guelmez / MLX-Benchmark
View on GitHub
The best benchmark for LLMs on Apple's MLX framework knowledge and coding tasks.
☆37Jun 12, 2026Updated last month
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
WeianMao / triattention
View on GitHub
TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained …
☆826Jul 14, 2026Updated last week
liranringel / ddtree
View on GitHub
☆389Apr 16, 2026Updated 3 months ago
Aaronhuang-778 / SliM-LLM
View on GitHub
[ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
☆62Aug 9, 2024Updated last year
awni / mlx_nanocode
View on GitHub
Minimal Claude Code alternative powered by MLX
☆47Jan 11, 2026Updated 6 months ago
0xClandestine / mirror-sd
View on GitHub
DFlash block-diffusion speculative decoding running on Apple Silicon via MLX, with an ANE execution path that explores heterogeneous acce…
☆62Apr 18, 2026Updated 3 months ago
utkarsh-dmx / project-resq
View on GitHub
☆35Mar 28, 2025Updated last year
intel / auto-round
View on GitHub
A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support…
☆1,530Updated this week
Cornell-RelaxML / yaqa-quantization
View on GitHub
☆84Jun 20, 2025Updated last year
mit-han-lab / flash-moba
View on GitHub
☆250Nov 19, 2025Updated 8 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
IST-DASLab / FP-Quant
View on GitHub
☆114Feb 26, 2026Updated 4 months ago
BrotherHappy / OSTQuant
View on GitHub
[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…
☆93Apr 8, 2025Updated last year
AutoLab-SAI-SJTU / QVLA
View on GitHub
[ICLR'26]QVLA
☆45Feb 4, 2026Updated 5 months ago
zhangsichengsjtu / AFPQ
View on GitHub
AFPQ code implementation
☆23Nov 6, 2023Updated 2 years ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
mit-han-lab / fastrl
View on GitHub
[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
☆174Feb 27, 2026Updated 4 months ago
vincentamato / mlx-esm-2
View on GitHub
An MLX implementation of Meta AI's ESM-2 protein language model
☆16Aug 16, 2025Updated 11 months ago
ModelCloud / GPTQModel
View on GitHub
LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM…
☆1,210Updated this week
Aryagm / dflash-mlx
View on GitHub
Exact speculative decoding on Apple Silicon, powered by MLX.
☆380Apr 20, 2026Updated 3 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
avbiswas / mlx-vlm-batch-outlines
View on GitHub
Minimalist repo to do mlx vlm constrained decoding in batch mode
☆18Apr 11, 2026Updated 3 months ago
nunchaku-ai / deepcompressor
View on GitHub
Model Compression Toolbox for Large Language Models and Diffusion Models
☆794Aug 14, 2025Updated 11 months ago
computer-graphics-tools / ane
View on GitHub
☆42Mar 5, 2026Updated 4 months ago
ruikangliu / FlatQuant
View on GitHub
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆223Nov 25, 2025Updated 7 months ago
Cornell-RelaxML / qtip
View on GitHub
☆180Jun 22, 2025Updated last year
Kai-Liu001 / Awesome-Model-Quantization
View on GitHub
This repository contains low-bit quantization papers from 2020 to 2026 on top conference.
☆192Jun 25, 2026Updated 3 weeks ago
mit-han-lab / lpd
View on GitHub
[ICLR 2026 Oral] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
☆104May 8, 2026Updated 2 months ago
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 7 months ago
CerebrasResearch / reap
View on GitHub
REAP: Router-weighted Expert Activation Pruning for SMoE compression
☆445Apr 17, 2026Updated 3 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
AMD-AGI / PARD
View on GitHub
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation (ICLR 26)
☆33Jun 10, 2026Updated last month
dlwns147 / amq
View on GitHub
[EMNLP 2025] AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models
☆16Apr 29, 2026Updated 2 months ago
Anemll / anemll-flash-llama.cpp
View on GitHub
Flash-MoE sidecar slot-bank runtime for large GGUF MoE models on Apple Silicon — llama.cpp fork
☆118Updated this week
facebookresearch / ParetoQ
View on GitHub
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆131Oct 15, 2025Updated 9 months ago
mit-han-lab / kernel-design-agents
View on GitHub
☆754Jun 2, 2026Updated last month
Goekdeniz-Guelmez / mlx-lm-lens
View on GitHub
Find the hidden meaning of LLMs
☆41Nov 13, 2025Updated 8 months ago
Intelligent-Computing-Lab-Panda / GPTAQ
View on GitHub
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆92Jul 28, 2025Updated 11 months ago