saic-fi / MobileQuant
[EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models
☆56Updated 7 months ago
Alternatives and similar repositories for MobileQuant:
Users that are interested in MobileQuant are comparing it to the libraries listed below
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆82Updated last month
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆168Updated 3 weeks ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆29Updated 5 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆68Updated 11 months ago
- ☆118Updated last year
- High-speed and easy-use LLM serving framework for local deployment☆99Updated last month
- ☆126Updated 3 weeks ago
- PB-LLM: Partially Binarized Large Language Models☆151Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆106Updated 6 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆208Updated 5 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆204Updated last year
- Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK☆62Updated last week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆22Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMs☆96Updated 2 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆115Updated 2 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- ☆68Updated 3 months ago
- ☆48Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- Work in progress.☆56Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆304Updated 9 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆35Updated last year
- GPU operators for sparse tensor operations☆32Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆159Updated 9 months ago
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆48Updated 9 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆195Updated last year
- Bamboo-7B Large Language Model☆92Updated last year
- ☆81Updated 3 weeks ago