Tencent / hpc-opsLinks

High Performance LLM Inference Operator Library

☆695

Alternatives and similar repositories for hpc-ops

Users that are interested in hpc-ops are comparing it to the libraries listed below

Sorting:

DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆270Updated 7 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆250Updated this week
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆484Updated 2 weeks ago
OpenPPL / ppl.llm.kernel.cuda
☆152Updated last year
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆457Updated 8 months ago
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆676Updated this week
CalebDu / Awesome-Cute
☆113Updated 8 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆404Updated this week
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆613Updated last month
flagos-ai / FlagCX
FlagCX is a scalable and adaptive cross-chip communication library.
☆172Updated this week
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆123Updated last month
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆554Updated 3 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆150Updated last year
stepfun-ai / StepMesh
☆342Updated last week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆219Updated 2 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆148Updated 8 months ago
flagos-ai / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆893Updated this week
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆227Updated 2 weeks ago
AlibabaPAI / FLASHNN
☆105Updated last year
reed-lau / cute-gemm
☆161Updated 2 months ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆115Updated 6 months ago
tile-ai / TileRT
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
☆564Updated 2 weeks ago
madsys-dev / deepseekv2-profile
☆155Updated 11 months ago
pzhao-eng / FlashMLA
☆61Updated 6 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆313Updated 7 months ago
InternLM / turbomind
☆96Updated 10 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆44Updated 11 months ago
ColfaxResearch / cfx-article-src
☆175Updated 9 months ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆283Updated 11 months ago
AyakaGEMM / Hands-on-GEMM
☆145Updated last year