yifuwang / symm-mem-recipesLinks
☆158Updated last year
Alternatives and similar repositories for symm-mem-recipes
Users that are interested in symm-mem-recipes are comparing it to the libraries listed below
Sorting:
- ☆258Updated last year
- ☆171Updated 8 months ago
- A lightweight design for computation-communication overlap.☆213Updated last week
- Github mirror of trition-lang/triton repo.☆126Updated last week
- ☆102Updated last year
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆158Updated 4 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆319Updated this week
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆164Updated last week
- ☆105Updated last year
- Allow torch tensor memory to be released and resumed later☆213Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆141Updated 8 months ago
- ☆112Updated 8 months ago
- nnScaler: Compiling DNN models for Parallel Training☆124Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆314Updated 5 months ago
- extensible collectives library in triton☆93Updated 10 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆82Updated last year
- ☆84Updated 3 years ago
- Autonomous GPU Kernel Generation via Deep Agents☆223Updated this week
- Benchmark code for the "Online normalizer calculation for softmax" paper☆105Updated 7 years ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆127Updated last year
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Updated last year
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆459Updated last month
- ☆77Updated 4 years ago
- Pipeline Parallelism Emulation and Visualization☆76Updated 3 weeks ago
- Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…☆82Updated 4 months ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆70Updated 10 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 6 months ago
- ☆159Updated 2 months ago
- Perplexity GPU Kernels☆554Updated 2 months ago
- Zero Bubble Pipeline Parallelism☆449Updated 8 months ago