haozixu / llama.cpp-npuLinks
☆39Updated this week
Alternatives and similar repositories for llama.cpp-npu
Users that are interested in llama.cpp-npu are comparing it to the libraries listed below
Sorting:
- Self-implemented NN operators for Qualcomm's Hexagon NPU☆34Updated 2 months ago
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆106Updated last week
- Code for ACM MobiCom 2024 paper "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices"☆57Updated 10 months ago
- Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK☆89Updated 2 weeks ago
- [ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"☆200Updated 3 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆151Updated 4 months ago
- A quantization algorithm for LLM☆146Updated last year
- ☆171Updated last week
- the original reference implementation of a specified llama.cpp backend for Qualcomm Hexagon NPU on Android phone, https://github.com/ggml…☆35Updated 5 months ago
- ☆60Updated last year
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆732Updated 4 months ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆113Updated 5 months ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆51Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆239Updated last month
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 5 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆110Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆123Updated last year
- LLM inference in C/C++☆48Updated this week
- ☆83Updated 10 months ago
- ☆168Updated 2 years ago
- ☆141Updated last year
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆131Updated last year
- This repository contains integer operators on GPUs for PyTorch.☆223Updated 2 years ago
- Code Repository of Evaluating Quantized Large Language Models☆137Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Updated 3 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆103Updated 7 years ago
- mperf是一个面向移动/嵌入式平台的算子性能调优工具箱☆192Updated 2 years ago
- [ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…☆86Updated 8 months ago
- ☆103Updated last year
- ☆125Updated 4 months ago