ROCm / AITemplateLinks
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
☆12Updated last year
Alternatives and similar repositories for AITemplate
Users that are interested in AITemplate are comparing it to the libraries listed below
Sorting:
- Fast and memory-efficient exact attention☆202Updated this week
- Development repository for the Triton language and compiler☆137Updated last week
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆492Updated this week
- AI Tensor Engine for ROCm☆311Updated this week
- ☆27Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆111Updated this week
- OpenAI Triton backend for Intel® GPUs☆222Updated this week
- ☆51Updated last week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆48Updated last year
- ☆248Updated last year
- Ahead of Time (AOT) Triton Math Library☆84Updated 3 weeks ago
- ☆67Updated this week
- collection of benchmarks to measure basic GPU capabilities☆468Updated last month
- SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs☆58Updated last week
- ROCm Communication Collectives Library (RCCL)☆404Updated this week
- Intel® Tensor Processing Primitives extension for Pytorch*☆17Updated last week
- ☆159Updated 7 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆394Updated 2 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆706Updated 3 months ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆36Updated 3 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆272Updated 4 months ago
- Github mirror of trition-lang/triton repo.☆104Updated this week
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆121Updated this week
- A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch☆24Updated this week
- cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it☆652Updated last week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 5 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆723Updated 4 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆114Updated this week
- Experimental projects related to TensorRT☆116Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆214Updated last week