OSU-Nowlab / FloverLinks
A novel temporal fusion framework for propelling autoregressive model inference
☆11Updated this week
Alternatives and similar repositories for Flover
Users that are interested in Flover are comparing it to the libraries listed below
Sorting:
- Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs☆12Updated 3 months ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆33Updated last year
- ☆40Updated this week
- ☆22Updated 2 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆108Updated last year
- Fast GPU based tensor core reductions☆13Updated 2 years ago
- A hierarchical collective communications library with portable optimizations☆35Updated 7 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆95Updated 2 months ago
- ☆26Updated 5 months ago
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆91Updated this week
- An extension library of WMMA API (Tensor Core API)☆99Updated last year
- Microsoft Collective Communication Library☆64Updated 7 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆48Updated 2 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 9 months ago
- ☆37Updated 7 months ago
- ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs☆84Updated last week
- Multi-GPU communication profiler and visualizer☆31Updated last year
- A CUTLASS implementation using SYCL☆30Updated last week
- Bandwidth test for ROCm☆60Updated this week
- ☆64Updated last year
- ☆80Updated this week
- ☆31Updated 5 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆33Updated 3 months ago
- RCCL Performance Benchmark Tests