aikitoria / nanotrace
View external linksLinks

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

☆130

Alternatives and similar repositories for nanotrace

Users that are interested in nanotrace are comparing it to the libraries listed below

Sorting:

NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆466Dec 31, 2025Updated last month
lcy-seso / DLFrameworkTest
View on GitHub
My tests and experiments with some popular dl frameworks.
☆17Sep 11, 2025Updated 5 months ago
tile-ai / tilescale
View on GitHub
Tile-based language built for AI computation across all scales
☆123Updated this week
KuangjuX / AttnLink
View on GitHub
An experimental communicating attention kernel based on DeepEP.
☆35Jul 29, 2025Updated 6 months ago
cherichy / tilecute
View on GitHub
☆32Jul 2, 2025Updated 7 months ago
tile-ai / tilelang-puzzles
View on GitHub
Learning TileLang with 10 puzzles!
☆132Jan 30, 2026Updated 2 weeks ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Sep 13, 2025Updated 5 months ago
infinigence / HamiltonAttention
View on GitHub
☆41Oct 15, 2025Updated 4 months ago
sandyresearch / chipmunk
View on GitHub
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆102Sep 8, 2025Updated 5 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated 9 months ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆163Updated this week
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆36Feb 3, 2025Updated last year
apache / tvm-ffi
View on GitHub
Open ABI and FFI for Machine Learning Systems
☆337Updated this week
ConvolutedDog / gpgpu-sim-comments
View on GitHub
GPGPU-Sim 中文注释版代码，包含 GPGPU-Sim 模拟器的最新版代码，经过中文注释，以帮助中文用户更好地理解和使用该模拟器。
☆28Dec 18, 2024Updated last year
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆74May 5, 2025Updated 9 months ago
Dao-AILab / quack
View on GitHub
A Quirky Assortment of CuTe Kernels
☆798Updated this week
yonsei-hpcp / gcom
View on GitHub
☆13May 8, 2025Updated 9 months ago
yu-yake2002 / ysyx-docker
View on GitHub
A docker image for One Student One Chip's debug exam
☆10Sep 22, 2023Updated 2 years ago
yhinai / TensorGPGPU
View on GitHub
RISC-V vector and tensor compute extensions for Vortex GPGPU acceleration for ML workloads. Optimized for transformer models, CNNs, and g…
☆21Apr 25, 2025Updated 9 months ago
wu-kan / wuk_cupti_wrapper
View on GitHub
a simple API to use CUPTI
☆11Aug 19, 2025Updated 5 months ago
flashinfer-ai / flashinfer-bench-starter-kit
View on GitHub
FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels
☆112Feb 9, 2026Updated last week
antmicro / astsee
View on GitHub
☆15Dec 17, 2025Updated last month
toyaix / triton-ocl
View on GitHub
Triton for OpenCL backend, and use mlir-translate to get source OpenCL code
☆24Aug 27, 2025Updated 5 months ago
sgl-project / sgl-flash-attn
View on GitHub
Fast and memory-efficient exact attention
☆18Jan 23, 2026Updated 3 weeks ago
luongthecong123 / fp8-quant-matmul
View on GitHub
Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.
☆17Feb 9, 2026Updated last week
Chocopy-LLVM / chocopy-llvm
View on GitHub
ChocoPy LLVM Repo
☆79Dec 9, 2022Updated 3 years ago
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆533Sep 18, 2025Updated 4 months ago
chengzeyi / piflux
View on GitHub
(WIP) Parallel inference for black-forest-labs' FLUX model.
☆18Nov 18, 2024Updated last year
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,350Feb 9, 2026Updated last week
JackonYang / hands-on-tvm
View on GitHub
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆49Jun 15, 2023Updated 2 years ago
0xD0GF00D / DocumentSASS
View on GitHub
Unofficial description of the CUDA assembly (SASS) instruction sets.
☆201Jul 18, 2025Updated 6 months ago
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆61Mar 25, 2025Updated 10 months ago
CentML / lorafusion
View on GitHub
LoRAFusion: Efficient LoRA Fine-Tuning for LLMs
☆23Sep 23, 2025Updated 4 months ago
GetUpEarlier / minit
View on GitHub
☆27May 27, 2024Updated last year
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆410Updated this week
HPMLL / NVIDIA-Hopper-Benchmark
View on GitHub
☆88May 31, 2025Updated 8 months ago
phirasit / shor_cuda
View on GitHub
Shor's algorithm simulation using CUDA
☆19Nov 10, 2019Updated 6 years ago
afantideng / caffe_comments
View on GitHub
Caffe 源码注释
☆15Aug 15, 2017Updated 8 years ago
temporal-hpc / reduction-tensor-cores
View on GitHub
Fast GPU based tensor core reductions
☆13Jan 13, 2023Updated 3 years ago

aikitoria / nanotraceView external linksLinks

Alternatives and similar repositories for nanotrace

aikitoria / nanotrace
View external linksLinks