yifuwang/symm-mem-recipes

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yifuwang/symm-mem-recipes)

yifuwang / symm-mem-recipes

☆170

Alternatives and similar repositories for symm-mem-recipes

Users that are interested in symm-mem-recipes are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

meta-pytorch / kraken
View on GitHub
Triton-based Symmetric Memory operators and examples
☆106May 15, 2026Updated 2 months ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,493Jul 11, 2026Updated last week
cchan / tccl
View on GitHub
extensible collectives library in triton
☆97Mar 31, 2025Updated last year
Dao-AILab / quack
View on GitHub
A Quirky Assortment of CuTe Kernels
☆1,060Updated this week
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆731Jul 4, 2026Updated 2 weeks ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,343Aug 28, 2025Updated 10 months ago
osayamenja / FlashMoE
View on GitHub
Distributed MoE in a Single Kernel [NeurIPS '25]
☆272May 5, 2026Updated 2 months ago
ROCm / iris
View on GitHub
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆193Updated this week
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆242Jan 20, 2026Updated 6 months ago
antgroup / DeepXTrace
View on GitHub
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆100Jan 16, 2026Updated 6 months ago
fzyzcjy / torch_memory_saver
View on GitHub
Allow torch tensor memory to be released and resumed later
☆259Updated this week
meta-pytorch / torchcomms
View on GitHub
torchcomms: a modern PyTorch communications API
☆377Updated this week
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆71Jan 15, 2026Updated 6 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆590Nov 7, 2025Updated 8 months ago
NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆560Updated this week
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆86May 5, 2025Updated last year
infinigence / HamiltonAttention
View on GitHub
☆45Oct 15, 2025Updated 9 months ago
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆136Nov 30, 2025Updated 7 months ago
facebookexperimental / triton
View on GitHub
Github mirror of trition-lang/triton repo.
☆178Updated this week
fzyzcjy / torch_utils
View on GitHub
Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…
☆114Sep 11, 2025Updated 10 months ago
microsoft / tokenweave
View on GitHub
Accepted to MLSys 2026
☆91Apr 19, 2026Updated 3 months ago
inclusionAI / asystem-amem
View on GitHub
A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
☆110Dec 17, 2025Updated 7 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
HazyResearch / Megakernels
View on GitHub
Kernels, of the mega variety :)
☆778May 26, 2026Updated last month
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆504Updated this week
meta-pytorch / MSLK
View on GitHub
MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…
☆121Updated this week
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Apr 14, 2026Updated 3 months ago
leepoly / sm-profiler
View on GitHub
☆82Feb 5, 2026Updated 5 months ago
meta-pytorch / tritonbench
View on GitHub
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆361Updated this week
toyaix / triton-runner
View on GitHub
Multi-Level Triton Runner supporting Python, IR, PTX, AMDGCN, cubin and hasco.
☆98May 8, 2026Updated 2 months ago
chenyu-jiang / dcp
View on GitHub
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
☆21Nov 28, 2025Updated 7 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
HanGuo97 / hilt
View on GitHub
☆40Dec 14, 2025Updated 7 months ago
pytorch / helion
View on GitHub
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆909Updated this week
flagos-ai / FlagGems
View on GitHub
FlagGems is an operator library for large language models implemented in the Triton Language.
☆1,052Updated this week
stepfun-ai / StepMesh
View on GitHub
☆377Jan 28, 2026Updated 5 months ago
mirage-project / mirage
View on GitHub
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆2,377Updated this week
CalebDu / Awesome-Cute
View on GitHub
☆121May 16, 2025Updated last year
simveit / effective_transpose
View on GitHub
Effective transpose on Hopper GPU
☆29Sep 6, 2025Updated 10 months ago