argonne-lcf / LLM-Inference-BenchLinks

LLM-Inference-Bench

☆57

Alternatives and similar repositories for LLM-Inference-Bench

Users that are interested in LLM-Inference-Bench are comparing it to the libraries listed below

Sorting:

AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆224Updated 2 years ago
hao-ai-lab / MuxServe
☆79Updated last month
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 9 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆62Updated last month
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆88Updated 2 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆62Updated 2 months ago
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Updated last year
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆102Updated 5 months ago
Azure / msccl
Microsoft Collective Communication Library
☆66Updated last year
WukLab / preble
Stateful LLM Serving
☆89Updated 8 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆163Updated last week
flagos-ai / FlagCX
☆128Updated last week
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆90Updated 5 months ago
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆67Updated 6 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆188Updated last month
abhibambhaniya / GenZ-LLM-Analyzer
LLM Inference analyzer for different hardware platforms
☆96Updated 4 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆262Updated last month
abcdabcd987 / libfabric-efa-demo
☆71Updated 10 months ago
microsoft / chunk-attention
☆82Updated 7 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
deepspeedai / DeepSpeed-Kernels
☆71Updated 8 months ago
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆31Updated last year
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
ROCm / TransformerEngine
☆51Updated this week
thu-pacman / FasterMoE
☆88Updated 3 years ago
tyler-griggs / melange-release
☆48Updated last year