toy-ai-top / TritonLLMLinks

LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model

☆44

Alternatives and similar repositories for TritonLLM

Users that are interested in TritonLLM are comparing it to the libraries listed below

Sorting:

tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆59Updated last week
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆128Updated last week
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆108Updated 4 months ago
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆67Updated this week
gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆41Updated last week
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆74Updated this week
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆113Updated last year
FlagTree / libtriton_jit
A Triton JIT runtime and ffi provider in C++
☆22Updated last week
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆85Updated last week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆171Updated last week
AlibabaPAI / FLASHNN
☆98Updated last year
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆28Updated 7 months ago
shenh10 / DeepSeek_Simulator
☆85Updated 5 months ago
LeiWang1999 / Stream-k.tvm
☆19Updated 11 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆116Updated 4 months ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆71Updated 3 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated last week
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆63Updated last month
HPMLL / NVIDIA-Hopper-Benchmark
☆57Updated 3 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆43Updated 2 weeks ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆185Updated 7 months ago
microsoft / FractalTensor
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆27Updated 9 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆75Updated last year
tsinghua-ideal / Canvas
Canvas: End-to-End Kernel Architecture Search in Neural Networks
☆27Updated 10 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆70Updated 4 months ago
CalebDu / Awesome-Cute
☆104Updated 4 months ago
sjtu-epcc / Tacker
Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
☆31Updated 7 months ago
heheda12345 / MagPy
☆39Updated last year
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆57Updated 6 months ago