OpenPPL/ppl.llm.serving

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/OpenPPL/ppl.llm.serving)

OpenPPL / ppl.llm.serving

☆128

Alternatives and similar repositories for ppl.llm.serving

Users that are interested in ppl.llm.serving are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

OpenPPL / ppl.pmx
View on GitHub
☆61Nov 21, 2024Updated last year
OpenPPL / ppl.nn.llm
View on GitHub
☆140Apr 23, 2024Updated 2 years ago
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆150Jan 9, 2025Updated last year
OpenPPL / ppl.kernel.cuda
View on GitHub
☆38Oct 12, 2024Updated last year
OpenPPL / ppl.kernel.cpu
View on GitHub
☆19Apr 6, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
OpenPPL / ppl.nn
View on GitHub
A primitive library for neural network
☆1,367Nov 24, 2024Updated last year
OpenPPL / ppl.common
View on GitHub
Common libraries for PPL projects
☆31Mar 10, 2025Updated last year
OpenPPL / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆85Mar 20, 2023Updated 3 years ago
OpenPPL / hpcc
View on GitHub
CMake configurations for PPL projects
☆12Aug 10, 2024Updated last year
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated 3 weeks ago
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year
KuangjuX / cu-x
View on GitHub
🎉My Collections of CUDA Kernels~
☆11Jun 25, 2024Updated 2 years ago
vancemiller / CUDA-preemption
View on GitHub
Experiments evaluating preemption on the NVIDIA Pascal architecture
☆16Nov 10, 2016Updated 9 years ago
JiangLiSJTU / token-ring
View on GitHub
☆13Jan 7, 2025Updated last year
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
OpenPPL / ppl.cv
View on GitHub
ppl.cv is a high-performance image processing library of openPPL supporting various platforms.
☆515Oct 30, 2024Updated last year
ModelTC / LightLLM
View on GitHub
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆4,183Updated this week
feifeibear / LLMRoofline
View on GitHub
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆123Mar 13, 2024Updated 2 years ago
triple-mu / HunyuanDiT-TensorRT-libtorch
View on GitHub
HunyuanDiT with TensorRT and libtorch
☆18May 22, 2024Updated 2 years ago
alibaba / rtp-llm
View on GitHub
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
☆1,282Updated this week
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
ankan-ban / llama_cu_awq
View on GitHub
llama INT4 cuda inference with AWQ
☆54Jan 20, 2025Updated last year
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,404Jun 23, 2026Updated 3 weeks ago
yvonwin / qwen2.cpp
View on GitHub
qwen2 and llama3 cpp implementation
☆50Jun 7, 2024Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
modelscope / dash-infer
View on GitHub
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆273Aug 6, 2025Updated 11 months ago
void-main / fastertransformer_backend
View on GitHub
☆22Jul 11, 2023Updated 3 years ago
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
Rayrtfr / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆17Jul 29, 2023Updated 2 years ago
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
YellowOldOdd / SDBI
View on GitHub
Simple Dynamic Batching Inference
☆144Mar 8, 2022Updated 4 years ago
RulinShao / FastCkpt
View on GitHub
Python package for rematerialization-aware gradient checkpointing
☆27Oct 31, 2023Updated 2 years ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
pigirons / conv3x3_m1
View on GitHub
This is a demo how to write a high performance convolution run on apple silicon
☆56Feb 8, 2022Updated 4 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
deepspeedai / DeepSpeed-Kernels
View on GitHub
☆76Mar 26, 2025Updated last year
OpenPPL / ppq
View on GitHub
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
☆1,806Mar 28, 2024Updated 2 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
parasailteam / coconet
View on GitHub
☆85Dec 2, 2022Updated 3 years ago
Dao-AILab / fast-hadamard-transform
View on GitHub
Fast Hadamard transform in CUDA, with a PyTorch interface
☆340Mar 10, 2026Updated 4 months ago
NVIDIA / TensorRT-LLM
View on GitHub
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆14,162Updated this week
InternLM / lmdeploy
View on GitHub
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
☆7,965Updated this week