hyperai / triton-cnLinks

Triton Documentation in Chinese Simplified / Triton 中文文档

☆78

Alternatives and similar repositories for triton-cn

Users that are interested in triton-cn are comparing it to the libraries listed below

Sorting:

harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆99Updated 3 weeks ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆58Updated 8 months ago
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉
☆194Updated 2 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆110Updated last year
madsys-dev / deepseekv2-profile
☆145Updated 5 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆92Updated 7 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆93Updated 2 months ago
InternLM / turbomind
☆92Updated 4 months ago
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
InfiniTensor / InfiniTensor
☆246Updated this week
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆129Updated last year
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆103Updated 2 months ago
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆64Updated last month
CalebDu / Awesome-Cute
☆89Updated 2 months ago
OpenPPL / ppl.llm.serving
☆128Updated 7 months ago
OpenPPL / ppl.nn.llm
☆139Updated last year
harleyszhang / lite_llama
A light llama-like llm inference framework based on the triton kernel.
☆144Updated last week
mdy666 / mdy_triton
☆140Updated last month
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆275Updated last year
YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆66Updated 2 years ago
pzhao-eng / FlashMLA
☆51Updated 2 weeks ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆74Updated 11 months ago
FlagOpen / FlagCX
☆81Updated this week
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆40Updated last month
InternLM / Awesome-LLM-Training-System
☆42Updated 11 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆39Updated 5 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month