Victarry / PP-Schedule-VisualizationLinks

Pipeline Parallelism Emulation and Visualization

☆54

Alternatives and similar repositories for PP-Schedule-Visualization

Users that are interested in PP-Schedule-Visualization are comparing it to the libraries listed below

Sorting:

infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆155Updated last month
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆275Updated last year
stepfun-ai / StepMesh
☆209Updated this week
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆114Updated this week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 3 months ago
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆61Updated last year
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆79Updated 8 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆134Updated 3 weeks ago
Karbo123 / pytorch_grouped_gemm
High Performance Grouped GEMM in PyTorch
☆30Updated 3 years ago
ColfaxResearch / cutlass-kernels
☆228Updated last year
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆281Updated this week
madsys-dev / deepseekv2-profile
☆145Updated 5 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 3 weeks ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆398Updated 2 months ago
CalebDu / Awesome-Cute
☆91Updated 2 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆129Updated 6 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
InternLM / turbomind
☆92Updated 4 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆107Updated 2 months ago
thunlp / Seq1F1B
Sequence-level 1F1B schedule for LLMs.
☆29Updated last month
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆71Updated 3 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆93Updated 2 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆110Updated last year
fzyzcjy / torch_memory_saver
Allow torch tensor memory to be released and resumed later
☆93Updated 3 weeks ago
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆66Updated 4 months ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆51Updated 5 months ago