AndreSlavescu / mHC.cu
View external linksLinks

mHC kernels implemented in CUDA

☆252

Alternatives and similar repositories for mHC.cu

Users that are interested in mHC.cu are comparing it to the libraries listed below

Sorting:

JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆19Aug 21, 2025Updated 5 months ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆163Updated this week
luongthecong123 / fp8-quant-matmul
View on GitHub
Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.
☆17Updated this week
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆26Jan 22, 2026Updated 3 weeks ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆36Feb 3, 2025Updated last year
HazyResearch / bwler
View on GitHub
Official repo for BWLer: Barycentric Weight Layer
☆29Sep 26, 2025Updated 4 months ago
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆18Nov 19, 2024Updated last year
YangLinzhuo / cuda-sgemm-optimization
View on GitHub
CUDA SGEMM optimization note
☆15Oct 31, 2023Updated 2 years ago
eth-cscs / Tiled-MM
View on GitHub
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
☆32Apr 2, 2025Updated 10 months ago
zaydzuhri / flame
View on GitHub
Fork of Flame repo for training of some new stuff in development
☆19Jan 5, 2026Updated last month
cassiewilliam / cuda_op_benchmark
View on GitHub
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆18Jun 13, 2024Updated last year
YdrMaster / cuda-driver
View on GitHub
基于 CUDA Driver API 的 cuda 运行时环境
☆15Jul 30, 2025Updated 6 months ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Aug 12, 2024Updated last year
Snektron / gpumode-amd-fp8-mm
View on GitHub
My submission for the GPUMODE/AMD fp8 mm challenge
☆29Jun 4, 2025Updated 8 months ago
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆72Jan 15, 2026Updated 3 weeks ago
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆583Feb 6, 2026Updated last week
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆96Sep 19, 2025Updated 4 months ago
muriloboratto / NVSHEMEM
View on GitHub
Sample Codes using NVSHMEM on Multi-GPU
☆30Jan 22, 2023Updated 3 years ago
caibucai22 / awesome-cuda
View on GitHub
Awesome code, projects, books, etc. related to CUDA
☆30Feb 3, 2026Updated last week
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆127Nov 30, 2025Updated 2 months ago
yinuotxie / Efficient-LLM-Inferencing-on-GPUs
View on GitHub
Penn CIS 5650 (GPU Programming and Architecture) Final Project
☆44Dec 11, 2023Updated 2 years ago
Bruce-Lee-LY / cuda_hgemv
View on GitHub
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆72Sep 8, 2024Updated last year
haizelabs / thorn-in-haizestack
View on GitHub
Thorn in a HaizeStack test for evaluating long-context adversarial robustness.
☆26Aug 3, 2024Updated last year
huggingface / hf-rocm-kernels
View on GitHub
☆23Jul 11, 2025Updated 7 months ago
mrDIMAS / tbc
View on GitHub
Texture Block Compression (BCn) written in Rust
☆11Apr 12, 2021Updated 4 years ago
HazyResearch / embroid
View on GitHub
Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification
☆11Aug 12, 2023Updated 2 years ago
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
AndPotap / einsum-search
View on GitHub
☆33Oct 4, 2024Updated last year
ColfaxResearch / cutlass-kernels
View on GitHub
☆261Jul 11, 2024Updated last year
kyegomez / MobileVLM
View on GitHub
Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …
☆15Mar 11, 2024Updated last year
NTT123 / cute-viz
View on GitHub
Cute layout visualization
☆30Jan 18, 2026Updated 3 weeks ago
TransluceAI / introspective-interp
View on GitHub
Repository for "Training Language Models To Explain Their Own Computations"
☆21Dec 22, 2025Updated last month
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆148May 10, 2025Updated 9 months ago
IST-DASLab / Quartet-II
View on GitHub
Quartet II Official Code
☆43Feb 2, 2026Updated last week
lucidrains / triangle-multiplicative-module
View on GitHub
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, …
☆39Aug 3, 2021Updated 4 years ago
PiotrNawrot / sparse-frontier
View on GitHub
The evaluation framework for training-free sparse attention in LLMs
☆117Jan 27, 2026Updated 2 weeks ago
MARD1NO / CUDA-PPT
View on GitHub
☆119Apr 2, 2025Updated 10 months ago
sustcsonglin / fla-tilelang
View on GitHub
☆35Mar 7, 2025Updated 11 months ago
antgroup / DeepXTrace
View on GitHub
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆93Jan 16, 2026Updated 3 weeks ago

AndreSlavescu / mHC.cuView external linksLinks

Alternatives and similar repositories for mHC.cu

AndreSlavescu / mHC.cu
View external linksLinks