TileLang / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators

☆15

Related projects ⓘ

Alternatives and complementary repositories for tvm

sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆11Updated 3 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆36Updated 6 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
☆25Updated 2 weeks ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆24Updated 2 weeks ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆30Updated 2 weeks ago
CalebDu / Awesome-Cute
☆16Updated this week
LeiWang1999 / Stream-k.tvm
☆18Updated last month
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆98Updated 2 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆57Updated 5 months ago
mlc-ai / xgrammar
Efficient, Flexible and Portable Structured Generation
☆53Updated this week
AlibabaPAI / FLASHNN
☆79Updated 2 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆52Updated 3 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆47Updated 2 months ago
mit-han-lab / tinychat-tutorial
☆52Updated 2 weeks ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆85Updated 8 months ago
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last month
InternLM / turbomind
☆35Updated 2 weeks ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆20Updated last week
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆48Updated 4 months ago
TiledTensor / TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
☆156Updated this week
chengzeyi / piflux
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆11Updated this week
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆34Updated 8 months ago
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆52Updated 2 years ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆75Updated 8 months ago
enp1s0 / ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
☆46Updated 2 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆30Updated 3 weeks ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆56Updated 11 months ago
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆60Updated 2 months ago