yinuotxie / Efficient-LLM-Inferencing-on-GPUsLinks

Penn CIS 5650 (GPU Programming and Architecture) Final Project

☆41

Alternatives and similar repositories for Efficient-LLM-Inferencing-on-GPUs

Users that are interested in Efficient-LLM-Inferencing-on-GPUs are comparing it to the libraries listed below

Sorting:

gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆32Updated last year
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 8 months ago
AlibabaPAI / FLASHNN
☆99Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆67Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆119Updated 5 months ago
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 4 years ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆40Updated 7 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 6 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆75Updated last year
microsoft / SparTA
☆152Updated last year
UDC-GAC / venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆53Updated last year
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆44Updated 4 months ago
CalebDu / Awesome-Cute
☆106Updated 4 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆111Updated 4 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆116Updated last year
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆60Updated this week
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆101Updated 7 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 8 months ago
MARD1NO / CUDA-PPT
☆109Updated 6 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated last year
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆76Updated 2 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆179Updated 3 weeks ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆135Updated 3 weeks ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
abhibambhaniya / GenZ-LLM-Analyzer
LLM Inference analyzer for different hardware platforms
☆94Updated 3 months ago
ColfaxResearch / cfx-article-src
☆148Updated 5 months ago
rchardx / cuda-gemm
☆28Updated 6 months ago