haozixu / llama.cpp-npuLinks

☆39

Alternatives and similar repositories for llama.cpp-npu

Users that are interested in llama.cpp-npu are comparing it to the libraries listed below

Sorting:

haozixu / htp-ops-lib
Self-implemented NN operators for Qualcomm's Hexagon NPU
☆34Updated 2 months ago
ARM-software / kleidiai
This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai
☆106Updated last week
xxxxyu / FlexNN
Code for ACM MobiCom 2024 paper "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices"
☆57Updated 10 months ago
MollySophia / rwkv-qualcomm
Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK
☆89Updated 2 weeks ago
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆200Updated 3 weeks ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆151Updated 4 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆146Updated last year
mlc-ai / relax
☆171Updated last week
jeffzhou2000 / ggml-hexagon
the original reference implementation of a specified llama.cpp backend for Qualcomm Hexagon NPU on Android phone, https://github.com/ggml…
☆35Updated 5 months ago
OpenPPL / ppl.pmx
☆60Updated last year
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆732Updated 4 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆113Updated 5 months ago
weishengying / tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆51Updated last year
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆239Updated last month
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆276Updated 5 months ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆110Updated 8 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆123Updated last year
chraac / llama.cpp
LLM inference in C/C++
☆48Updated this week
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
Qualcomm-AI-research / FP8-quantization
☆168Updated 2 years ago
OpenPPL / ppl.nn.llm
☆141Updated last year
OpenBitSys / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆131Updated last year
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆223Updated 2 years ago
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆137Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 3 months ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆103Updated 7 years ago
MegEngine / mperf
mperf是一个面向移动/嵌入式平台的算子性能调优工具箱
☆192Updated 2 years ago
BrotherHappy / OSTQuant
[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…
☆86Updated 8 months ago
AlibabaPAI / FLASHNN
☆103Updated last year
ByteDance-Seed / cudaLLM
☆125Updated 4 months ago