CalvinXKY / mfu_calculationLinks

A simple calculation for LLM MFU.

☆48

Alternatives and similar repositories for mfu_calculation

Users that are interested in mfu_calculation are comparing it to the libraries listed below

Sorting:

fzyzcjy / torch_utils
Utility scripts for PyTorch (e.g. Memory profiler that understands more low-level allocations such as NCCL)
☆62Updated last month
microsoft / chunk-attention
☆78Updated 6 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
fzyzcjy / torch_memory_saver
Allow torch tensor memory to be released and resumed later
☆150Updated last week
OpenSQZ / MegatronApp
Toolchain built around the Megatron-LM for Distributed Training
☆67Updated last week
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆70Updated last week
tile-ai / AttentionEngine
☆50Updated 5 months ago
InternLM / turbomind
☆97Updated 7 months ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆26Updated 9 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
InternLM / Awesome-LLM-Training-System
☆43Updated last year
KuangjuX / AttnLink
An experimental communicating attention kernel based on DeepEP.
☆34Updated 2 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆60Updated 11 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆156Updated 2 weeks ago
zhuzilin / flash-attention-with-sink
☆39Updated 2 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆61Updated 3 weeks ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆139Updated last month
Ascend / torchair
☆18Updated 2 weeks ago
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆80Updated 11 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
thu-pacman / FasterMoE
☆87Updated 3 years ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆66Updated last year
sgl-project / sglang-jax
JAX backend for SGL
☆78Updated this week
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆50Updated last year
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆120Updated 6 months ago