toyaix / TritonLLMLinks

LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model

☆61

Alternatives and similar repositories for TritonLLM

Users that are interested in TritonLLM are comparing it to the libraries listed below

Sorting:

OpenPPL / ppl.llm.kernel.cuda
☆152Updated 11 months ago
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆93Updated this week
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆63Updated last year
flagos-ai / flagtree
FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-lang…
☆146Updated this week
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆119Updated 7 months ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
AlibabaPAI / FLASHNN
☆103Updated last year
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆138Updated 7 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆190Updated 10 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆106Updated this week
CalebDu / Awesome-Cute
☆112Updated 7 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 3 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆119Updated last year
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆147Updated 3 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆43Updated 9 months ago
gogongxt / nano-sglang
☆91Updated 3 weeks ago
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆84Updated last week
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Updated last year
dsl-learn / cutile-learn
NVIDIA cuTile learn
☆130Updated 2 weeks ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆113Updated 5 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆242Updated last month
ArthurinRUC / cutlass-notes
From Minimal GEMM to Everything
☆87Updated last month
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Updated 2 years ago
mlc-ai / notebooks
☆216Updated last year
DeepLink-org / DLCompiler
triton for dsa
☆48Updated last week
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 11 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆98Updated last year
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆200Updated 2 months ago
InternLM / turbomind
☆97Updated 8 months ago
OpenPPL / ppl.nn.llm
☆141Updated last year