OpenPPL / ppl.nn.llm

☆140

Related projects ⓘ

Alternatives and complementary repositories for ppl.nn.llm

OpenPPL / ppl.llm.serving
☆123Updated this week
OpenPPL / ppl.llm.kernel.cuda
☆136Updated this week
OpenPPL / ppl.pmx
☆56Updated this week
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆26Updated 2 months ago
AyakaGEMM / Hands-on-GEMM
☆97Updated 7 months ago
AlibabaPAI / FLASHNN
☆79Updated 2 months ago
void-main / FasterTransformer
Transformer related optimization, including BERT, GPT
☆60Updated last year
OpenPPL / ppl.kernel.cuda
☆32Updated last month
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆215Updated 5 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆85Updated 8 months ago
MARD1NO / CUDA-PPT
☆79Updated last year
reed-lau / cute-gemm
☆78Updated 8 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆52Updated 3 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆76Updated last month
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆98Updated 2 months ago
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆200Updated last month
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆82Updated 5 months ago
bytedance / flux
A fast communication-overlapping library for tensor parallelism on GPUs.
☆219Updated last week
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆226Updated this week
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆74Updated 7 months ago
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
☆329Updated this week
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆84Updated last year
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆296Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆288Updated last month
weishengying / tiny-flash-attention
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆32Updated 3 months ago
ColfaxResearch / cutlass-kernels
☆162Updated 4 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆195Updated 4 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆196Updated 2 weeks ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆46Updated 3 months ago
TiledTensor / TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
☆148Updated this week