NVIDIA / nvshmemLinks

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.

☆393

Alternatives and similar repositories for nvshmem

Users that are interested in nvshmem are comparing it to the libraries listed below

Sorting:

ROCm / iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆119Updated last week
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆531Updated 3 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆188Updated last month
meta-pytorch / torchcomms
torchcomms: a modern PyTorch communications API
☆295Updated this week
apache / tvm-ffi
Open ABI and FFI for Machine Learning Systems
☆211Updated this week
yifuwang / symm-mem-recipes
☆147Updated 11 months ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆437Updated last week
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆144Updated 2 months ago
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆100Updated this week
ColfaxResearch / cutlass-kernels
☆246Updated last year
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆175Updated last week
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆294Updated this week
fzyzcjy / torch_memory_saver
Allow torch tensor memory to be released and resumed later
☆177Updated this week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆190Updated 10 months ago
stepfun-ai / StepMesh
☆324Updated 2 weeks ago
NVIDIA / tilus
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
☆404Updated last week
triton-lang / kernels
☆94Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆134Updated 6 months ago
HazyResearch / Megakernels
kernels, of the mega variety
☆614Updated 2 months ago
aikitoria / nanotrace
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
☆112Updated 2 weeks ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆675Updated last week
ColfaxResearch / layout-categories
This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".
☆80Updated 2 months ago
perplexityai / pplx-garden
Perplexity open source garden for inference technology
☆274Updated last week
Deep-Learning-Profiling-Tools / triton-viz
☆256Updated this week
ademeure / cuda-side-boost
☆51Updated 6 months ago
ColfaxResearch / cfx-article-src
☆158Updated 6 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆444Updated 6 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆164Updated 3 weeks ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆400Updated 2 months ago