haozixu / llama.cpp-npuLinks
☆33Updated last month
Alternatives and similar repositories for llama.cpp-npu
Users that are interested in llama.cpp-npu are comparing it to the libraries listed below
Sorting:
- Self-implemented NN operators for Qualcomm's Hexagon NPU☆29Updated last month
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆99Updated this week
- Code for ACM MobiCom 2024 paper "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices"☆56Updated 10 months ago
- Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK☆87Updated last month
- A quantization algorithm for LLM☆146Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆146Updated 3 months ago
- ☆170Updated 2 weeks ago
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆129Updated last year
- ☆166Updated 2 years ago
- [ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"☆192Updated this week
- Code Repository of Evaluating Quantized Large Language Models☆137Updated last year
- ☆60Updated last year
- ☆76Updated last year
- ☆83Updated 10 months ago
- ☆125Updated 3 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆331Updated last year
- This repository contains integer operators on GPUs for PyTorch.☆223Updated 2 years ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆713Updated 3 months ago
- the original reference implementation of a specified llama.cpp backend for Qualcomm Hexagon NPU on Android phone, https://github.com/ggml…☆35Updated 4 months ago
- High-speed and easy-use LLM serving framework for local deployment☆137Updated 3 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆109Updated 7 months ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆112Updated 11 months ago
- [ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆227Updated 10 months ago
- ☆125Updated last year
- [ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…☆83Updated 7 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆272Updated 4 months ago
- Reorder-based post-training quantization for large language model☆197Updated 2 years ago
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization☆166Updated this week
- LLM inference in C/C++☆46Updated last week
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆113Updated last month