Infrawaves / DeepEP_ibrc_dual-ports_multiQPView external linksLinks
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆73May 9, 2025Updated 9 months ago
Alternatives and similar repositories for DeepEP_ibrc_dual-ports_multiQP
Users that are interested in DeepEP_ibrc_dual-ports_multiQP are comparing it to the libraries listed below
Sorting:
- Handwritten GEMM using Intel AMX (Advanced Matrix Extension)☆17Jan 11, 2025Updated last year
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated 3 weeks ago
- ☆38Aug 7, 2025Updated 6 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆78Aug 12, 2024Updated last year
- FlagCX is a scalable and adaptive cross-chip communication library.☆173Updated this week
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆22Updated this week
- Perplexity GPU Kernels☆560Nov 7, 2025Updated 3 months ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Nov 23, 2024Updated last year
- ☆34Feb 3, 2025Updated last year
- ☆114May 16, 2025Updated 8 months ago
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 6 months ago
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated 11 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148May 10, 2025Updated 9 months ago
- ☆61Jul 17, 2025Updated 6 months ago
- SGEMM optimization with cuda step by step☆21Mar 23, 2024Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 7 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Jul 4, 2025Updated 7 months ago
- ☆104Nov 7, 2024Updated last year
- DeeperGEMM: crazy optimized version☆73May 5, 2025Updated 9 months ago
- A lightweight design for computation-communication overlap.☆219Jan 20, 2026Updated 3 weeks ago
- ☆47Dec 13, 2024Updated last year
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- 🎉My Collections of CUDA Kernels~☆11Jun 25, 2024Updated last year
- ☆162Feb 5, 2026Updated last week
- Venus Collective Communication Library, supported by SII and Infrawaves.☆138Updated this week
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 6 months ago
- ☆41Nov 1, 2025Updated 3 months ago
- Kernel Library Wheel for SGLang☆17Updated this week
- ☕️ A vscode extension for netron, support *.pdmodel, *.nb, *.onnx, *.pb, *.h5, *.tflite, *.pth, *.pt, *.mnn, *.param, etc.☆14Jun 4, 2023Updated 2 years ago
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆26Jan 22, 2026Updated 3 weeks ago
- ☆52May 19, 2025Updated 8 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Feb 27, 2025Updated 11 months ago
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 10 months ago
- Triton implement of bi-directional (non-causal) linear attention☆65Feb 2, 2026Updated last week
- deepstream + cuda,yolo26,yolo-master,yolo11,yolov8,sam,transformer, etc.☆35Feb 7, 2026Updated last week
- Scripts for managing Debian and RPM package repositories☆15Jan 14, 2026Updated last month
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆251Updated this week