saic-fi / MobileQuant
[EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models
☆56Updated 7 months ago
Alternatives and similar repositories for MobileQuant
Users that are interested in MobileQuant are comparing it to the libraries listed below
Sorting:
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆37Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆169Updated last month
- Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK☆64Updated last week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆209Updated 5 months ago
- ☆156Updated last month
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆121Updated last month
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆106Updated 7 months ago
- ☆131Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆109Updated this week
- ☆70Updated 3 months ago
- High-speed and easy-use LLM serving framework for local deployment☆103Updated last month
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆206Updated last year
- Reorder-based post-training quantization for large language model☆190Updated last year
- QuIP quantization☆52Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆84Updated 2 months ago
- llama INT4 cuda inference with AWQ☆54Updated 3 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆97Updated last month
- ☆73Updated 5 months ago
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization☆133Updated 3 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆41Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Updated last year
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆266Updated 7 months ago
- KV cache compression for high-throughput LLM inference☆126Updated 3 months ago
- Compression for Foundation Models☆31Updated last month
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆109Updated 10 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆28Updated 5 months ago
- ☆119Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆308Updated 10 months ago