Anonymous1252022 / fp4-all-the-wayLinks

☆38

Alternatives and similar repositories for fp4-all-the-way

Users that are interested in fp4-all-the-way are comparing it to the libraries listed below

Sorting:

IST-DASLab / QuEST
Work in progress.
☆75Updated last week
IST-DASLab / Quartet
☆111Updated 2 weeks ago
Cornell-RelaxML / qtip
☆158Updated 5 months ago
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆111Updated last year
tile-ai / AttentionEngine
☆51Updated 6 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆106Updated last month
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆110Updated 7 months ago
FasterDecoding / TEAL
☆154Updated 9 months ago
selfsupervised-ai / Natural-GaLore
An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace
☆18Updated last year
Dao-AILab / grouped-latent-attention
☆132Updated 6 months ago
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆33Updated last year
Cornell-RelaxML / yaqa-quantization
☆64Updated 5 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆61Updated last year
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆80Updated last year
IST-DASLab / HALO
HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…
☆29Updated 9 months ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆126Updated 5 months ago
Aaronhuang-778 / SliM-LLM
[ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
☆46Updated last year
Infini-AI-Lab / vortex_torch
Vortex: A Flexible and Efficient Sparse Attention Framework
☆33Updated last week
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated last month
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆168Updated last week
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆157Updated 2 years ago
ruikangliu / Quantized-Reasoning-Models
[COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"
☆61Updated 4 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
LiqunMa / FBI-LLM
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
☆51Updated 3 months ago
IntelLabs / Hardware-Aware-Automated-Machine-Learning
☆71Updated 4 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
ScalingIntelligence / CATS
☆31Updated last year
Qualcomm-AI-research / lr-qat
☆49Updated last year
SNU-ARC / any-precision-llm
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆121Updated 5 months ago