enyac-group / QuambaLinks

The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]

☆51

Alternatives and similar repositories for Quamba

Users that are interested in Quamba are comparing it to the libraries listed below

Sorting:

ScalingIntelligence / CATS
☆26Updated 8 months ago
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆125Updated 4 months ago
FasterDecoding / TEAL
☆136Updated 5 months ago
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆109Updated 9 months ago
SqueezeAILab / SqueezedAttention
SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference
☆49Updated 7 months ago
Dao-AILab / grouped-latent-attention
☆116Updated last month
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆24Updated 8 months ago
ruikangliu / Quantized-Reasoning-Models
[COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"
☆39Updated last week
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆85Updated last month
HuangOwen / RoLoRA
[EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
☆37Updated 9 months ago
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆31Updated last year
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".
☆71Updated 3 months ago
HanGuo97 / log-linear-attention
☆222Updated last month
thu-ml / Jetfire-INT8Training
☆52Updated 11 months ago
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆65Updated last year
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆191Updated last week
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆139Updated this week
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆68Updated 4 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆82Updated 3 weeks ago
IST-DASLab / QuEST
Work in progress.
☆70Updated 2 weeks ago
ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
☆93Updated 10 months ago
Qualcomm-AI-research / gptvq
☆31Updated last year
Aaronhuang-778 / SliM-LLM
[ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
☆34Updated 11 months ago
IntelLabs / Hardware-Aware-Automated-Machine-Learning
☆60Updated 3 weeks ago
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆113Updated 7 months ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆89Updated 3 weeks ago
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆41Updated last year
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆118Updated last month
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆201Updated last year
VITA-Group / llm-kick
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
☆24Updated 2 months ago