enyac-group / QuambaLinks
The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]
β66Updated 6 months ago
Alternatives and similar repositories for Quamba
Users that are interested in Quamba are comparing it to the libraries listed below
Sorting:
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterizationβ111Updated last year
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. π The official implementation of https://arxβ¦β29Updated 10 months ago
- LLM Inference with Microscaling Formatβ34Updated last year
- β85Updated 11 months ago
- β31Updated last year
- β157Updated 10 months ago
- β40Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inferenceβ55Updated last year
- β60Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ151Updated 10 months ago
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantizationβ38Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inferenceβ160Updated 2 months ago
- [COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"β64Updated 6 months ago
- A framework to compare low-bit integer and float-point formatsβ54Updated 2 months ago
- β133Updated 7 months ago
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantizationβ170Updated last month
- Official implementation for Training LLMs with MXFP4β116Updated 8 months ago
- β44Updated 7 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ175Updated last year
- The evaluation framework for training-free sparse attention in LLMsβ108Updated 2 months ago
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"β116Updated 2 months ago
- xKV: Cross-Layer SVD for KV-Cache Compressionβ43Updated last month
- Fast and memory-efficient exact attentionβ74Updated 10 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoringβ263Updated 6 months ago
- [ICML 2024] Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocksβ38Updated 11 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interfaceβ271Updated 2 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>β153Updated last month
- Transformers components but in Tritonβ34Updated 8 months ago
- 16-fold memory access reduction with nearly no lossβ109Updated 9 months ago
- Vortex: A Flexible and Efficient Sparse Attention Frameworkβ43Updated last month