OpenBMB / CPM.cuLinks

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and quantization.

☆225

Alternatives and similar repositories for CPM.cu

Users that are interested in CPM.cu are comparing it to the libraries listed below

Sorting:

zhaochenyang20 / ModelServer
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang
☆61Updated last year
InternLM / turbomind
☆96Updated 10 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆154Updated 5 months ago
madsys-dev / deepseekv2-profile
☆155Updated 11 months ago
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆251Updated last year
Tencent / AngelSlim
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
☆314Updated this week
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆250Updated this week
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆153Updated last year
modelscope / dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆274Updated 6 months ago
hyperai / triton-cn
Triton Documentation in Chinese Simplified / Triton 中文文档
☆102Updated last month
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆110Updated 10 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆148Updated last year
neuralmagic / AutoFP8
☆206Updated 9 months ago
casper-hansen / AutoAWQ_kernels
☆79Updated last year
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆120Updated last year
flagos-ai / FlagCX
FlagCX is a scalable and adaptive cross-chip communication library.
☆172Updated this week
microsoft / AttentionEngine
☆118Updated 8 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆111Updated last week
sihyeong / Awesome-LLM-Inference-Engine
☆166Updated 2 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆327Updated 2 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆238Updated this week
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆141Updated last year
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆336Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆184Updated 10 months ago
Tencent / hpc-ops
High Performance LLM Inference Operator Library
☆695Updated this week
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆46Updated 7 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆394Updated 7 months ago
NickL77 / BaldEagle
3x Faster Inference; Unofficial implementation of EAGLE Speculative Decoding
☆83Updated 7 months ago
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆204Updated 2 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆101Updated last month