intel / tiny-dpcpp-nnLinks

SYCL implementation of Fused MLPs for Intel GPUs

☆47

Alternatives and similar repositories for tiny-dpcpp-nn

Users that are interested in tiny-dpcpp-nn are comparing it to the libraries listed below

Sorting:

salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆95Updated 5 months ago
NVIDIA / jaxpp
JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training
☆49Updated last month
gpu-mode / reference-kernels
Reference Kernels for the Leaderboard
☆60Updated last week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 2 weeks ago
pytorch-labs / tritonparse
TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…
☆93Updated last week
codeplaysoftware / cutlass-sycl
A CUTLASS implementation using SYCL
☆27Updated this week
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆50Updated last week
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆88Updated 5 months ago
intel / torch-xpu-ops
☆46Updated this week
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆99Updated last year
manishucsd / py-codegen
☆16Updated 9 months ago
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated 10 months ago
pytorch-labs / superblock
A block oriented training approach for inference time optimization.
☆33Updated 10 months ago
vdesai2014 / inference-optimization-blog-post
☆88Updated last year
cchan / tccl
extensible collectives library in triton
☆86Updated 2 months ago
HanGuo97 / log-linear-attention
☆216Updated 3 weeks ago
sandeepkumar-skb / pytorch_custom_op
End to End steps for adding custom ops in PyTorch.
☆23Updated 4 years ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆117Updated this week
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆66Updated last week
iree-org / iree-nvgpu
☆50Updated last year
intel / intel-extension-for-openxla
☆47Updated 3 weeks ago
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆43Updated 3 months ago
hyhieu / easy_pybind
☆32Updated last year
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆166Updated this week
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated 11 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆167Updated this week
ROCm / amd_matrix_instruction_calculator
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
☆98Updated last month
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 11 months ago
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆91Updated 2 weeks ago
microsoft / AttentionEngine
☆71Updated last month