intel / tiny-dpcpp-nn
SYCL implementation of Fused MLPs for Intel GPUs
☆43Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for tiny-dpcpp-nn
- ☆32Updated 5 months ago
- ☆30Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆35Updated 6 months ago
- Learning about CUDA by writing PTX code.☆28Updated 8 months ago
- GPUOcelot: A dynamic compilation framework for PTX☆147Updated last month
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆44Updated last month
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 5 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆46Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Attention in SRAM on Tenstorrent Grayskull☆29Updated 4 months ago
- A block oriented training approach for inference time optimization.☆30Updated 3 months ago
- ☆39Updated 2 months ago
- rocWMMA☆92Updated this week
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆68Updated 10 months ago
- Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.☆20Updated 2 weeks ago
- ☆153Updated this week
- ☆44Updated last week
- ☆14Updated last month
- CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.☆33Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆145Updated this week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆271Updated this week
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- ☆133Updated 9 months ago
- extensible collectives library in triton☆72Updated last month
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆95Updated this week
- OpenAI Triton backend for Intel® GPUs☆143Updated this week
- ☆13Updated this week
- IREE's PyTorch Frontend, based on Torch Dynamo.☆55Updated this week