Separate from hardware and used to learn some NCCL mechanisms
☆25Apr 19, 2024Updated last year
Alternatives and similar repositories for NCCL_GP
Users that are interested in NCCL_GP are comparing it to the libraries listed below
Sorting:
- Official repository of Alibaba erdma drivers☆36Jul 23, 2025Updated 7 months ago
- A way to use cuda to accelerate top k algorithm☆30Jul 11, 2017Updated 8 years ago
- A parser for PTX 6.5☆13Jun 19, 2023Updated 2 years ago
- ☆11Sep 4, 2022Updated 3 years ago
- ☆49Apr 15, 2024Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- A lightweight design for computation-communication overlap.☆223Jan 20, 2026Updated last month
- ☆11Nov 14, 2023Updated 2 years ago
- ☆11Jun 9, 2023Updated 2 years ago
- 华为集合通信性能测试☆15May 27, 2024Updated last year
- a fast and customizable CUDA int4 tensor core gemm☆15Aug 2, 2024Updated last year
- ☆50Mar 5, 2024Updated 2 years ago
- Code Repository for the NeurIPS 2024 Paper "Toward Efficient Inference for Mixture of Experts".☆19Oct 30, 2024Updated last year
- 集成 highlight.js 为文章提供代码块高亮渲染☆11Sep 26, 2025Updated 5 months ago
- 一个用Apple Metal实现的Llama和通义千问大模型本地推理☆10Apr 26, 2024Updated last year
- ☆10Feb 3, 2022Updated 4 years ago
- RET code☆11Mar 28, 2023Updated 2 years ago
- ☆13Apr 27, 2022Updated 3 years ago
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated last month
- MACCIA 2022 paper reading notes: tasks and datasets☆12Feb 6, 2023Updated 3 years ago
- repository for the MICCAI 2022 AutoPET challenge☆14Sep 19, 2022Updated 3 years ago
- Development area for another repo: Learn_Bluespec_and_RISCV_Design☆13Nov 10, 2025Updated 3 months ago
- A slurm solution for Crusoe Cloud☆13Updated this week
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆30Jan 22, 2026Updated last month
- ☆11Jan 23, 2024Updated 2 years ago
- Open Network Linux - An Operating System for Bare Metal Switches☆16Oct 27, 2021Updated 4 years ago
- ☆14May 28, 2019Updated 6 years ago
- Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM☆75Aug 12, 2025Updated 6 months ago
- Scripts which perform an installable binary image build for SONiC☆15Feb 14, 2026Updated 3 weeks ago
- Provides a tool to test CXL with Kernel and Qemu setup☆21Feb 20, 2026Updated 2 weeks ago
- CXL Management Interface library☆24Jan 27, 2026Updated last month
- This is a repository of Binary General Matrix Multiply (BGEMM) by customized CUDA kernel. Thank FP6-LLM for the wheels!☆18Aug 30, 2024Updated last year
- Open Source Projects from Pallas Lab☆21Oct 10, 2021Updated 4 years ago
- Halo 搜索组件的插件☆13Nov 20, 2025Updated 3 months ago
- Simple python library for generating your own perfetto traces for your application. Can be used for both app instrumentation and custom …☆25Jun 22, 2025Updated 8 months ago
- collection of benchmarks to measure basic GPU capabilities☆497Oct 24, 2025Updated 4 months ago
- ☆24Apr 13, 2025Updated 10 months ago
- Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.☆15Feb 8, 2023Updated 3 years ago
- Venus Collective Communication Library, supported by SII and Infrawaves.☆138Updated this week