maxiaosong1124/ncu-cuda-profiling-skill

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/maxiaosong1124/ncu-cuda-profiling-skill)

maxiaosong1124 / ncu-cuda-profiling-skill

let coding agents use ncu skills analysis cuda program automatically!

☆117

Alternatives and similar repositories for ncu-cuda-profiling-skill

Users that are interested in ncu-cuda-profiling-skill are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

gxinlong / cuda-optimization-skill
View on GitHub
A skill for automatically optimizing CUDA code.
☆42Mar 26, 2026Updated 4 months ago
slowlyC / agent-gpu-skills
View on GitHub
☆153Updated this week
mit-han-lab / ncu-report-skill
View on GitHub
☆159May 24, 2026Updated 2 months ago
deciding / cutez
View on GitHub
CuTeDSL tutorials, tools, autotuner, profiler, etc.
☆41Jun 27, 2026Updated last month
sablin39 / tilelang-cuda-skills
View on GitHub
Skills for writing tilelang and debugging with CUDA toolkits.
☆133May 20, 2026Updated 2 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
ZJLi2013 / awesome-kernel-skills
View on GitHub
☆87Mar 31, 2026Updated 3 months ago
Tencent / hpc-ops
View on GitHub
High Performance LLM Inference Operator Library
☆1,071Updated this week
KernelFlow-ops / cuda-optimized-skill
View on GitHub
A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It …
☆193Apr 22, 2026Updated 3 months ago
BBuf / KDA-Pilot
View on GitHub
☆234Updated this week
KuangjuX / cuda-evolve-oss
View on GitHub
Autonomous GPU kernel optimization system driven by AI agents.
☆31Mar 29, 2026Updated 4 months ago
mit-han-lab / KernelWiki
View on GitHub
☆314Jun 9, 2026Updated last month
BBuf / AI-Infra-Auto-Driven-SKILLS
View on GitHub
☆699Updated this week
EnigmaHuang / Saad_Book_ForTran
View on GitHub
Some "Formula Translations" for Yousef Saad's book "Iterative Methods for Sparse Linear Systems (2nd Edition)"
☆13Jan 14, 2018Updated 8 years ago
NTT123 / cute-viz
View on GitHub
Cute layout visualization
☆44Jan 18, 2026Updated 6 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
tensormux / kernel-skills
View on GitHub
Open source skill library for AI coding agents to write, optimize, and debug high performance compute kernels across CUDA, Triton, and qu…
☆51Jun 21, 2026Updated last month
reed-lau / cute-gemm
View on GitHub
☆189May 11, 2026Updated 2 months ago
technillogue / ptx-isa-markdown
View on GitHub
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
☆220Dec 24, 2025Updated 7 months ago
rchardx / hopper-gemm
View on GitHub
☆48Nov 1, 2025Updated 8 months ago
zeroine / cutlass-cute-sample
View on GitHub
☆49Apr 15, 2024Updated 2 years ago
ForceInjection / cuda-code-skill
View on GitHub
将 NVIDIA PTX ISA 9.1、CUDA 13.1 (Runtime/Driver)、Math API 13.x、cuBLAS 13.2 及 NCCL 官方文档转换为易于检索的 Markdown 格式，并提供配套的 AI IDE 技能库（支持 Claude Cod…
☆20Updated this week
leepoly / sm-profiler
View on GitHub
☆84Feb 5, 2026Updated 5 months ago
TongmingLAIC / AKO4ALL
View on GitHub
Agentic Kernel Optimization for All — automated GPU kernel optimization for any kernel, any hardware, any language
☆329May 31, 2026Updated last month
temporal-hpc / reduction-tensor-cores
View on GitHub
Fast GPU based tensor core reductions
☆12Jan 13, 2023Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆21Aug 21, 2025Updated 11 months ago
mit-han-lab / kernel-design-agents
View on GitHub
☆779Jun 2, 2026Updated last month
gfvvz / triton-learning-materials
View on GitHub
Triton Compiler related materials.
☆45Mar 16, 2026Updated 4 months ago
NVIDIA / nsight-python
View on GitHub
Nsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools
☆285Updated this week
KuangjuX / ncu-cli
View on GitHub
Automated CUDA kernel performance diagnostics from NVIDIA Nsight Compute (NCU) CSV exports.
☆34Mar 18, 2026Updated 4 months ago
gevtushenko / block_matrix_format_performance
View on GitHub
☆12Jan 19, 2020Updated 6 years ago
kalfazed / multi-thread-programming
View on GitHub
This is a repository to practice multi-thread programming in C++
☆31Feb 21, 2024Updated 2 years ago
vipulSharma18 / NCCL-From-First-Principles
View on GitHub
NCCL communication API layer, and transport layer created from first principles.
☆16Aug 20, 2025Updated 11 months ago
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆230Jul 9, 2026Updated 2 weeks ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
windog-labs / edge-fm-x
View on GitHub
边缘端大模型高效推理引擎
☆31Updated this week
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆318Updated this week
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,152Updated this week
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆528Jan 20, 2026Updated 6 months ago
tile-ai / TileRT
View on GitHub
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
☆1,596Jul 14, 2026Updated 2 weeks ago
alan-hpc / cuda_op_benchmark
View on GitHub
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆18Jun 13, 2024Updated 2 years ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated last month