alibaba / easydistLinks

Automated Parallelization System and Infrastructure for Multiple Ecosystems

☆79

Alternatives and similar repositories for easydist

Users that are interested in easydist are comparing it to the libraries listed below

Sorting:

infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆145Updated 3 weeks ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated 2 months ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆113Updated this week
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 5 months ago
yifuwang / symm-mem-recipes
☆94Updated 6 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 2 weeks ago
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
microsoft / SparTA
☆148Updated 11 months ago
triton-lang / kernels
☆83Updated 8 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆125Updated 5 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆176Updated last week
InternLM / turbomind
☆87Updated 3 months ago
ColfaxResearch / cutlass-kernels
☆214Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆100Updated last month
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆273Updated last year
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆67Updated 3 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆86Updated 2 months ago
parasailteam / coconet
☆79Updated 2 years ago
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆121Updated 3 years ago
zhuohan123 / terapipe
☆75Updated 4 years ago
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆40Updated 2 years ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆112Updated 11 months ago
CalebDu / Awesome-Cute
☆82Updated last month
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆154Updated 3 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆107Updated last year
hao-ai-lab / MuxServe
☆62Updated last year
Azure / msccl
Microsoft Collective Communication Library
☆64Updated 7 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆79Updated last week