NVIDIA/online-softmax

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/online-softmax)

NVIDIA / online-softmax

Benchmark code for the "Online normalizer calculation for softmax" paper

☆110

Alternatives and similar repositories for online-softmax

Users that are interested in online-softmax are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
lianakoleva / no-libtorch-compile
View on GitHub
☆21Mar 3, 2025Updated last year
latentCall145 / channels-last-groupnorm
View on GitHub
A CUDA kernel for NHWC GroupNorm for PyTorch
☆23Nov 15, 2024Updated last year
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
View on GitHub
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆45Dec 11, 2023Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
pzhao-eng / FlashMLA
View on GitHub
☆66Feb 15, 2026Updated 5 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆270Jul 11, 2024Updated 2 years ago
zeroine / cutlass-cute-sample
View on GitHub
☆49Apr 15, 2024Updated 2 years ago
ankan-ban / llama2.cu
View on GitHub
Inference Llama 2 in one file of pure Cuda
☆17Aug 20, 2023Updated 2 years ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 6 months ago
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆150Jan 9, 2025Updated last year
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 4 years ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
CalebDu / Awesome-Cute
View on GitHub
☆122May 16, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
tpoisonooo / how-to-optimize-gemm
View on GitHub
row-major matmul optimization
☆744May 14, 2026Updated 2 months ago
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆558Sep 8, 2024Updated last year
BBuf / tensorrt-llm-moe
View on GitHub
☆34Feb 3, 2025Updated last year
open-thought / qwen3-vla
View on GitHub
Experimental adapter for fine-tuning Qwen3-VL as a Vision-Language-Action (VLA) model
☆15Dec 21, 2025Updated 7 months ago
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
osayamenja / FlashMoE
View on GitHub
Distributed MoE in a Single Kernel [NeurIPS '25]
☆278May 5, 2026Updated 2 months ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated last month
tgale96 / grouped_gemm
View on GitHub
PyTorch bindings for CUTLASS grouped GEMM.
☆154May 29, 2025Updated last year
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆528Jan 20, 2026Updated 6 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
ShaYeBuHui01 / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆15Aug 31, 2023Updated 2 years ago
njuhope / cuda_sgemm
View on GitHub
☆121Apr 11, 2024Updated 2 years ago
carsonpo / safetensors.cpp
View on GitHub
Zero Dependency LibTorch Safetensors Loading and Storing in C++
☆23Jul 12, 2024Updated 2 years ago
KuangjuX / AttnLink
View on GitHub
An experimental communicating attention kernel based on DeepEP.
☆34Jul 29, 2025Updated last year
OpenPPL / ppl.nn.llm
View on GitHub
☆140Apr 23, 2024Updated 2 years ago
MARD1NO / CUDA-PPT
View on GitHub
☆136Apr 16, 2026Updated 3 months ago
wangzhaode / jinja.cpp
View on GitHub
A lightweight, single-header C++11 Jinja2 template engine for LLM chat templates.
☆20Mar 4, 2026Updated 4 months ago
YukSing12 / anchordetr_tensorrt
View on GitHub
trt-hackathon-2022 三等奖方案
☆10Mar 6, 2023Updated 3 years ago
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆48Jun 11, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆506Jul 17, 2026Updated last week
facebookexperimental / protoquant
View on GitHub
Prototype routines for GPU quantization written using PyTorch.
☆21Apr 15, 2026Updated 3 months ago
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆157May 10, 2025Updated last year
gcoe-dresden / cuda-gpu-tlb
View on GitHub
TLB Benchmarks
☆35Sep 11, 2017Updated 8 years ago
Farseer-Scaling-Law / Farseer
View on GitHub
☆21Jun 12, 2025Updated last year
ColfaxResearch / layout-categories
View on GitHub
This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".
☆140Sep 24, 2025Updated 10 months ago
P1ayer-1 / Llama-LibTorch
View on GitHub
Llama causal LM fully recreated in LibTorch. Designed to be used in Unreal Engine 5
☆16Sep 19, 2024Updated last year