FlagOpen / FlagCXLinks

☆125

Alternatives and similar repositories for FlagCX

Users that are interested in FlagCX are comparing it to the libraries listed below

Sorting:

infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆114Updated 5 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆183Updated last month
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆267Updated 2 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆124Updated 6 months ago
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆75Updated last week
BBuf / tensorrt-llm-moe
☆33Updated 9 months ago
OpenPPL / ppl.llm.kernel.cuda
☆152Updated 10 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆226Updated 3 months ago
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆97Updated last week
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆140Updated 9 months ago
sii-research / VCCL
Venus Collective Communication Library, supported by SII and Infrawaves.
☆108Updated last week
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆142Updated last month
OpenPPL / ppl.llm.serving
☆130Updated 10 months ago
AlibabaPAI / FLASHNN
☆101Updated last year
CalebDu / Awesome-Cute
☆108Updated 5 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆119Updated last year
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆129Updated this week
madsys-dev / deepseekv2-profile
☆149Updated 8 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆87Updated 5 months ago
InternLM / turbomind
☆97Updated 7 months ago
AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆53Updated 3 months ago
stepfun-ai / StepMesh
☆312Updated last week
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆66Updated 6 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
OpenPPL / ppl.pmx
☆60Updated 11 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 8 months ago
DeepLink-org / deeplink.framework
☆72Updated last year
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆82Updated this week
Cambricon / torch_mlu
☆44Updated 7 months ago
shenh10 / DeepSeek_Simulator
☆90Updated 7 months ago