thuml / learn_torch.compileLinks

torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile

☆17

Alternatives and similar repositories for learn_torch.compile

Users that are interested in learn_torch.compile are comparing it to the libraries listed below

Sorting:

feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
microsoft / AttentionEngine
☆75Updated last month
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
tile-ai / AttentionEngine
☆49Updated last month
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated 2 months ago
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆59Updated this week
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆42Updated 2 weeks ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆91Updated 2 weeks ago
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆77Updated 5 months ago
tile-ai / TileOPs
☆42Updated this week
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
LeiWang1999 / Stream-k.tvm
☆19Updated 9 months ago
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated last year
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆31Updated 7 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆87Updated 2 months ago
mit-han-lab / tinychat-tutorial
☆71Updated 8 months ago
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆90Updated 5 months ago
flashinfer-ai / debug-print
Debug print operator for cudagraph debugging
☆12Updated 11 months ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 7 months ago
sgl-project / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆17Updated last month
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆47Updated 11 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆184Updated this week
sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆13Updated 11 months ago
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆54Updated 3 years ago
NoakLiu / FastCache-xDiT
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [Efficient ML Model]
☆29Updated last month
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆50Updated 3 months ago
Dao-AILab / gemm-cublas
☆21Updated 2 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated last month