HydraQYH/hp_rms_norm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/HydraQYH/hp_rms_norm)

HydraQYH / hp_rms_norm

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

☆30

Alternatives and similar repositories for hp_rms_norm

Users that are interested in hp_rms_norm are comparing it to the libraries listed below

Sorting:

HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Jan 19, 2026Updated last month
zhuzilin / flash-attention-with-sink
View on GitHub
☆38Aug 7, 2025Updated 7 months ago
Yifei-Zuo / Flash-LLA
View on GitHub
Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…
☆23Oct 1, 2025Updated 5 months ago
ACA-Lab-SJTU / token-ring
View on GitHub
☆13Jan 7, 2025Updated last year
KuangjuX / AttnLink
View on GitHub
An experimental communicating attention kernel based on DeepEP.
☆35Jul 29, 2025Updated 7 months ago
madsys-dev / deepseekv2-profile
View on GitHub
☆155Mar 4, 2025Updated last year
NTT123 / cute-viz
View on GitHub
Cute layout visualization
☆31Jan 18, 2026Updated last month
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆21Feb 9, 2026Updated last month
HectorHHZ / Sparse_Matrix_Tuning
View on GitHub
Github repo for ICLR-2025 paper, Fine-tuning Large Language Models with Sparse Matrices
☆24Feb 2, 2026Updated last month
ByteDance-Seed / StragglerAnalysis
View on GitHub
☆51Apr 30, 2025Updated 10 months ago
IST-DASLab / Quartet-II
View on GitHub
Quartet II Official Code
☆53Mar 1, 2026Updated last week
ppogg / Deepstream-Box
View on GitHub
deepstream + cuda，yolo26，yolo-master，yolo11，yolov8，sam，transformer, etc.
☆36Feb 7, 2026Updated last month
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆150May 10, 2025Updated 10 months ago
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆21Aug 3, 2025Updated 7 months ago
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆46Jun 11, 2025Updated 8 months ago
sgl-project / sgl-cookbook
View on GitHub
Cookbook of SGLang - Recipe
☆93Updated this week
NonvolatileMemory / flash_tree_attn
View on GitHub
☆20Dec 24, 2024Updated last year
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆165Feb 11, 2026Updated 3 weeks ago
li199603 / sgemm_with_cuda
View on GitHub
SGEMM optimization with cuda step by step
☆21Mar 23, 2024Updated last year
caibucai22 / awesome-cuda
View on GitHub
Awesome code, projects, books, etc. related to CUDA
☆31Feb 3, 2026Updated last month
Fridge003 / Cuda-Learn-By-Practice
View on GitHub
Codebase for Cuda Learning
☆31Jul 13, 2024Updated last year
HPMLL / NVIDIA-Hopper-Benchmark
View on GitHub
☆89May 31, 2025Updated 9 months ago
zartbot / shallowsim
View on GitHub
DeepSeek-V3/R1 inference performance simulator
☆180Mar 27, 2025Updated 11 months ago
serdes21 / flashtile
View on GitHub
FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.
☆56Feb 6, 2026Updated last month
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆223Jan 20, 2026Updated last month
Enigmatisms / cuda-pt
View on GitHub
Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel re…
☆37Oct 5, 2025Updated 5 months ago
nex-agi / NexVenusCL
View on GitHub
Nex Venus Communication Library
☆72Nov 17, 2025Updated 3 months ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆74May 5, 2025Updated 10 months ago
PluralisResearch / AsyncPP
View on GitHub
Asynchronous pipeline parallel optimization
☆19Feb 2, 2026Updated last month
WaveSpeedAI / QuantumAttention
View on GitHub
[WIP] Better (FP8) attention for Hopper
☆32Feb 24, 2025Updated last year
tile-ai / TileOPs
View on GitHub
☆88Updated this week
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆79Aug 12, 2024Updated last year
tensorchord / deepseek-api-arena
View on GitHub
A benchmarking tool for comparing different LLM API providers' DeepSeek model deployments.
☆30Mar 28, 2025Updated 11 months ago
mdy666 / Qwen-Native-Sparse-Attention
View on GitHub
qwen-nsa
☆87Oct 14, 2025Updated 4 months ago
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆147Mar 18, 2024Updated last year
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆183Feb 10, 2026Updated last month
globaledgesoft / Unsupported-Operation-Development-in-SNPE
View on GitHub
This project is intended to build and deploy an SNPE model on Qualcomm Devices, which are having unsupported layers which are not part of…
☆10Oct 4, 2021Updated 4 years ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆262Jul 11, 2024Updated last year
LGAI-Research / SetR
View on GitHub
☆20Sep 11, 2025Updated 5 months ago