feifeibear / LLMRooflineLinks

Compare different hardware platforms via the Roofline Model for LLM inference tasks.

☆110

Alternatives and similar repositories for LLMRoofline

Users that are interested in LLMRoofline are comparing it to the libraries listed below

Sorting:

InternLM / turbomind
☆92Updated 4 months ago
madsys-dev / deepseekv2-profile
☆145Updated 5 months ago
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆103Updated 2 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆154Updated last month
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆82Updated last week
OpenPPL / ppl.llm.serving
☆128Updated 7 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆128Updated 6 months ago
flashinfer-ai / cutlass-viz
☆60Updated 3 months ago
microsoft / chunk-attention
☆78Updated 3 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆61Updated last year
stepfun-ai / StepMesh
☆209Updated this week
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
thunlp / Seq1F1B
Sequence-level 1F1B schedule for LLMs.
☆29Updated last month
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆42Updated 5 months ago
InternLM / Awesome-LLM-Training-System
☆42Updated 11 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆93Updated 2 months ago
CalebDu / Awesome-Cute
☆89Updated 2 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆137Updated 3 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆71Updated 2 months ago
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆55Updated this week
WukLab / preble
Stateful LLM Serving
☆77Updated 4 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆40Updated last month
LoongServe / LoongServe
☆109Updated 8 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
OpenPPL / ppl.nn.llm
☆139Updated last year
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆61Updated last week
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆79Updated 8 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆39Updated 5 months ago