OSU-Nowlab / FloverLinks

A novel temporal fusion framework for propelling autoregressive model inference

☆11

Alternatives and similar repositories for Flover

Users that are interested in Flover are comparing it to the libraries listed below

Sorting:

shixun404 / Fault-Tolerant-SGEMM-on-NVIDIA-GPUs
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
☆12Updated 3 months ago
muriloboratto / NCCL
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…
☆33Updated last year
ROCm / TransformerEngine
☆40Updated this week
HabanaAI / hccl_demo
☆22Updated 2 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆108Updated last year
temporal-hpc / reduction-tensor-cores
Fast GPU based tensor core reductions
☆13Updated 2 years ago
merthidayetoglu / HiCCL
A hierarchical collective communications library with portable optimizations
☆35Updated 7 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆95Updated 2 months ago
Oneflow-Inc / dfccl
☆26Updated 5 months ago
ROCm / rocSHMEM
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
☆91Updated this week
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
Azure / msccl
Microsoft Collective Communication Library
☆64Updated 7 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆48Updated 2 months ago
Lin-Mao / DrGPUM
A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.
☆25Updated 9 months ago
Azure / msccl-executor-nccl
☆37Updated 7 months ago
ROCm / roctracer
ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
☆84Updated last week
ParCoreLab / Snoopie
Multi-GPU communication profiler and visualizer
☆31Updated last year
codeplaysoftware / cutlass-sycl
A CUTLASS implementation using SYCL
☆30Updated last week
ROCm / rocm_bandwidth_test
Bandwidth test for ROCm
☆60Updated this week
hao-ai-lab / MuxServe
☆64Updated last year
FlagOpen / FlagCX
☆80Updated this week
BBuf / tensorrt-llm-moe
☆31Updated 5 months ago
eth-cscs / Tiled-MM
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
☆33Updated 3 months ago
ROCm / rccl-tests
RCCL Performance Benchmark Tests
☆70Updated this week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆146Updated 3 weeks ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
microsoft / ark
A GPU-driven system framework for scalable AI applications
☆117Updated 5 months ago
Ascend / torchair
☆16Updated this week
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆62Updated last week
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆79Updated 7 months ago