dlsyscourse / lecture14

☆9

Related projects: ⓘ

habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆20Updated 5 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆35Updated 4 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆93Updated last week
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆82Updated 6 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆50Updated 3 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆46Updated 2 months ago
microsoft / DeepSpeed-Kernels
☆50Updated 3 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆20Updated last week
ColfaxResearch / cfx-article-src
☆17Updated this week
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 3 years ago
masahi / torchscript-to-tvm
☆66Updated last year
yester31 / Cutlass_EX
study of cutlass
☆18Updated last year
AlibabaPAI / FLASHNN
☆67Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆81Updated 2 months ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆129Updated last year
NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆16Updated last month
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆61Updated 2 years ago
tridao / cutlass_quant
☆16Updated this week
nod-ai / transformer-benchmarks
benchmarking some transformer deployments
☆26Updated last year
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆52Updated 6 years ago
openxla / openxla-nvgpu
☆48Updated 6 months ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆73Updated 2 years ago
MARD1NO / FxxkCUDA
☆52Updated this week
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆58Updated 6 months ago
NeuHub / TVMDeepDive
☆22Updated 4 years ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆41Updated 3 weeks ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆81Updated last year
eedalong / Dpex
Distributed DataLoader For Pytorch Based On Ray
☆24Updated 2 years ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆118Updated last year
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆40Updated last week