intel / tiny-dpcpp-nn
SYCL implementation of Fused MLPs for Intel GPUs
☆46Updated 3 months ago
Alternatives and similar repositories for tiny-dpcpp-nn:
Users that are interested in tiny-dpcpp-nn are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆76Updated last month
- Learning about CUDA by writing PTX code.☆35Updated 11 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆85Updated 3 weeks ago
- LLM training in simple, raw C/CUDA☆91Updated 9 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆74Updated last year
- Python Module for PyTorch Tensor Visualisation in CUDA Eliminating CPU Transfer☆37Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆49Updated 4 months ago
- A block oriented training approach for inference time optimization.☆32Updated 6 months ago
- ☆34Updated this week
- An extension library of WMMA API (Tensor Core API)☆88Updated 7 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆55Updated this week
- ☆15Updated 4 months ago
- A hands-on introduction to tuning GPU kernels using Kernel Tuner https://github.com/KernelTuner/kernel_tuner/☆30Updated 5 months ago
- ☆32Updated 8 months ago
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆100Updated this week
- Attention in SRAM on Tenstorrent Grayskull☆31Updated 7 months ago
- ☆59Updated last month
- Step-by-step optimization of CUDA SGEMM☆285Updated 2 years ago
- ☆181Updated 7 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆38Updated 9 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆75Updated 2 months ago
- ☆86Updated 11 months ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 6 months ago
- ☆43Updated last week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆97Updated 7 months ago
- End to End steps for adding custom ops in PyTorch.☆20Updated 4 years ago
- ☆19Updated 3 months ago
- ☆142Updated last year
- Fastest kernels written from scratch☆173Updated last week