AndreSlavescu / mHC.cuView external linksLinks
mHC kernels implemented in CUDA
☆252Jan 14, 2026Updated last month
Alternatives and similar repositories for mHC.cu
Users that are interested in mHC.cu are comparing it to the libraries listed below
Sorting:
- GEMV implementation with CUTLASS☆19Aug 21, 2025Updated 5 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Updated this week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Updated this week
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆26Jan 22, 2026Updated 3 weeks ago
- A practical way of learning Swizzle☆36Feb 3, 2025Updated last year
- Official repo for BWLer: Barycentric Weight Layer☆29Sep 26, 2025Updated 4 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- CUDA SGEMM optimization note☆15Oct 31, 2023Updated 2 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 10 months ago
- Fork of Flame repo for training of some new stuff in development☆19Jan 5, 2026Updated last month
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆18Jun 13, 2024Updated last year
- 基于 CUDA Driver API 的 cuda 运行时环境☆15Jul 30, 2025Updated 6 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆78Aug 12, 2024Updated last year
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 8 months ago
- a size profiler for cuda binary☆72Jan 15, 2026Updated 3 weeks ago
- Accelerating MoE with IO and Tile-aware Optimizations☆583Feb 6, 2026Updated last week
- Framework to reduce autotune overhead to zero for well known deployments.☆96Sep 19, 2025Updated 4 months ago
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- Awesome code, projects, books, etc. related to CUDA☆30Feb 3, 2026Updated last week
- CUTLASS and CuTe Examples☆127Nov 30, 2025Updated 2 months ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆44Dec 11, 2023Updated 2 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Sep 8, 2024Updated last year
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Aug 3, 2024Updated last year
- ☆23Jul 11, 2025Updated 7 months ago
- Texture Block Compression (BCn) written in Rust☆11Apr 12, 2021Updated 4 years ago
- Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification☆11Aug 12, 2023Updated 2 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Jan 28, 2025Updated last year
- ☆33Oct 4, 2024Updated last year
- ☆261Jul 11, 2024Updated last year
- Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …☆15Mar 11, 2024Updated last year
- Cute layout visualization☆30Jan 18, 2026Updated 3 weeks ago
- Repository for "Training Language Models To Explain Their Own Computations"☆21Dec 22, 2025Updated last month
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148May 10, 2025Updated 9 months ago
- Quartet II Official Code☆43Feb 2, 2026Updated last week
- Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, …☆39Aug 3, 2021Updated 4 years ago
- The evaluation framework for training-free sparse attention in LLMs☆117Jan 27, 2026Updated 2 weeks ago
- ☆119Apr 2, 2025Updated 10 months ago
- ☆35Mar 7, 2025Updated 11 months ago
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated 3 weeks ago