chips-compilers-mlsys-21 / chips-compilers-mlsys-21.github.ioLinks

☆11

Alternatives and similar repositories for chips-compilers-mlsys-21.github.io

Users that are interested in chips-compilers-mlsys-21.github.io are comparing it to the libraries listed below

Sorting:

zhuzilin / pytorch-malloc
An external memory allocator example for PyTorch.
☆16Updated 5 months ago
LeiWang1999 / Stream-k.tvm
☆20Updated last year
microsoft / FractalTensor
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆32Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 4 months ago
meta-pytorch / MSLK
MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…
☆45Updated this week
jiazhihao / attention_superoptimizer
An Attention Superoptimizer
☆22Updated last year
yester31 / Cutlass_EX
study of cutlass
☆22Updated last year
awslabs / lorien
☆42Updated 2 years ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆74Updated 3 years ago
flagos-ai / libtriton_jit
A Triton JIT runtime and ffi provider in C++
☆31Updated last week
sjtu-epcc / Tacker
Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
☆34Updated 11 months ago
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆57Updated 3 years ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Updated 7 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Updated last year
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆51Updated last year
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆124Updated 3 years ago
comaniac / epoi
Benchmark PyTorch Custom Operators
☆14Updated 2 years ago
bytedance / QSync
Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".
☆20Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 5 years ago
billmuch / matmul_perf_test
☆15Updated 3 years ago
TiledTensor / TiledLower
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Updated last year
cmu-catalyst / collage
System for automated integration of deep learning backends.
☆47Updated 3 years ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 9 months ago
UofT-EcoSystem / DietCode
DietCode Code Release
☆65Updated 3 years ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆18Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Updated last year
tsinghua-ideal / Canvas
Canvas: End-to-End Kernel Architecture Search in Neural Networks
☆27Updated last year