intel / tiny-dpcpp-nnLinks
SYCL implementation of Fused MLPs for Intel GPUs
☆47Updated 2 months ago
Alternatives and similar repositories for tiny-dpcpp-nn
Users that are interested in tiny-dpcpp-nn are comparing it to the libraries listed below
Sorting:
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆16Updated this week
- High-Performance SGEMM on CUDA devices☆97Updated 7 months ago
- ☆32Updated last year
- LLM training in simple, raw C/CUDA☆104Updated last year
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆67Updated 3 weeks ago
- ☆53Updated this week
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆61Updated last week
- ☆237Updated 2 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆74Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆95Updated 2 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆112Updated 3 months ago
- A parallel framework for training deep neural networks☆63Updated 5 months ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆40Updated last year
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆52Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- ☆163Updated last year
- ☆74Updated 8 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆92Updated 7 months ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆51Updated last month
- Super fast FP32 matrix multiplication on RDNA3☆71Updated 4 months ago
- A block oriented training approach for inference time optimization.☆34Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels☆144Updated this week
- FlashRNN - Fast RNN Kernels with I/O Awareness☆96Updated 2 months ago
- extensible collectives library in triton☆88Updated 4 months ago
- ☆16Updated 11 months ago
- Ahead of Time (AOT) Triton Math Library☆75Updated this week
- An extension library of WMMA API (Tensor Core API)☆103Updated last year
- ☆88Updated last year
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆349Updated this week
- ☆41Updated last week