Azure / msccl-executor-nccl
☆36Updated 2 months ago
Alternatives and similar repositories for msccl-executor-nccl:
Users that are interested in msccl-executor-nccl are comparing it to the libraries listed below
- NCCL Profiling Kit☆127Updated 7 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆17Updated this week
- Microsoft Collective Communication Library☆61Updated 2 months ago
- Thunder Research Group's Collective Communication Library☆32Updated 9 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 2 months ago
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling☆58Updated 9 months ago
- nnScaler: Compiling DNN models for Parallel Training☆91Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated 11 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆115Updated last year
- ☆74Updated 2 years ago
- Stateful LLM Serving☆46Updated 6 months ago
- Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“☆60Updated 8 months ago
- ☆81Updated 5 months ago
- Synthesizer for optimal collective communication algorithms☆103Updated 10 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆51Updated last week
- High performance Transformer implementation in C++.☆101Updated 3 weeks ago
- PyTorch distributed training acceleration framework☆39Updated this week
- ☆38Updated 8 months ago
- ☆42Updated this week
- Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines☆17Updated last year
- ☆70Updated 3 years ago
- ☆83Updated 3 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆294Updated this week
- RDMA and SHARP plugins for nccl library☆175Updated 3 weeks ago
- A distributed KV store for disaggregated LLM inference☆28Updated this week
- ☆42Updated last month
- ☆50Updated 8 months ago
- GVProf: A Value Profiler for GPU-based Clusters☆49Updated 10 months ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆295Updated 3 months ago
- TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.☆53Updated this week