NVIDIA / nvshmemLinks
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmers to perform one-sided communication from within CUDA kernels and on CUDA streams.
☆461Updated last month
Alternatives and similar repositories for nvshmem
Users that are interested in nvshmem are comparing it to the libraries listed below
Sorting:
- torchcomms: a modern PyTorch communications API☆323Updated this week
- Perplexity GPU Kernels☆554Updated 2 months ago
- A lightweight design for computation-communication overlap.☆213Updated last week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆457Updated this week
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆164Updated this week
- Open ABI and FFI for Machine Learning Systems☆313Updated this week
- ☆159Updated last year
- ☆342Updated this week
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆158Updated 4 months ago
- Helpful kernel tutorials and examples for tile-based GPU programming☆617Updated this week
- ☆258Updated last year
- Github mirror of trition-lang/triton repo.☆129Updated this week
- ☆173Updated 8 months ago
- Perplexity open source garden for inference technology☆350Updated last month
- ☆76Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆189Updated this week
- Dynamic Memory Management for Serving LLMs without PagedAttention☆457Updated 8 months ago
- Thunder Research Group's Collective Communication Library☆47Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆319Updated this week
- Allow torch tensor memory to be released and resumed later☆213Updated 3 weeks ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148Updated 8 months ago
- Fastest kernels written from scratch☆528Updated 4 months ago
- ☆87Updated 8 months ago
- Tutorials for NVIDIA CUPTI samples☆50Updated 3 months ago
- Accelerating MoE with IO and Tile-aware Optimizations☆563Updated 2 weeks ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆82Updated last year
- ☆102Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆440Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆71Updated 4 months ago