li199603/sgemm_with_cuda

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/li199603/sgemm_with_cuda)

li199603 / sgemm_with_cuda

SGEMM optimization with cuda step by step

☆23

Alternatives and similar repositories for sgemm_with_cuda

Users that are interested in sgemm_with_cuda are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Yongqi-Zhuo / triton-tvm
View on GitHub
Triton to TVM transpiler.
☆24Oct 14, 2024Updated last year
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆48Jun 11, 2025Updated last year
richjjj / cuvid-tensorrt-multi
View on GitHub
ffmpeg+cuvid+tensorrt+multicamera
☆12Dec 31, 2024Updated last year
shouxieai / nerf_from_scratch
View on GitHub
重构nerf代码，更加容易读懂
☆13Mar 26, 2023Updated 3 years ago
YdrMaster / cuda-driver
View on GitHub
基于 CUDA Driver API 的 cuda 运行时环境
☆16Jul 30, 2025Updated 11 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year
coderonion / cuda-beginner-course-cpp-version
View on GitHub
bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码
☆35Aug 12, 2024Updated last year
Zolewit / TNNdemo
View on GitHub
很好用的tnn classify demo
☆11Mar 24, 2021Updated 5 years ago
Qengineering / YoloV8-seg-NPU
View on GitHub
YoloV8 segmentation NPU for the RK 3566/68/88
☆18Apr 30, 2024Updated 2 years ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
hova88 / CUDA-MatMul-Practice
View on GitHub
☆19Jan 4, 2024Updated 2 years ago
KuangjuX / cu-x
View on GitHub
🎉My Collections of CUDA Kernels~
☆11Jun 25, 2024Updated 2 years ago
xlite-dev / qwen-image-fast
View on GitHub
⚡️Qwen-Image 4.8x🎉 speedup with Hybrid Acceleration for low VRAM GPUs
☆17Oct 24, 2025Updated 9 months ago
shouxieai / bevfusion_02hero
View on GitHub
☆17Nov 14, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
sunkx109 / My-Torch-Extension
View on GitHub
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.
☆41Jan 24, 2026Updated 6 months ago
pointpillars-on-openvino / pointpillars-on-openvino
View on GitHub
☆12Dec 16, 2021Updated 4 years ago
LeiWang1999 / AutoGPTQ.tvm
View on GitHub
GPTQ inference TVM kernel
☆41Apr 25, 2024Updated 2 years ago
lucasjinreal / wnnx_models
View on GitHub
Various test models in WNNX format. It can view with `pip install wnetron && wnetron`
☆12Jun 22, 2022Updated 4 years ago
l-sf / LightTrack_openvino
View on GitHub
本仓库基于 Intel OpenVINO Toolkit 部署 LightTrack 跟踪算法，包含 Python、C++ 两种语言的推理代码.
☆21Nov 2, 2023Updated 2 years ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆45Feb 3, 2025Updated last year
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
NonvolatileMemory / flash_tree_attn
View on GitHub
☆20Dec 24, 2024Updated last year
iimmortall / QuantLib
View on GitHub
☆14Feb 3, 2022Updated 4 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
nicolaswilde / cuda-sgemm
View on GitHub
☆73Jan 6, 2025Updated last year
CalebDu / Awesome-Cute
View on GitHub
☆122May 16, 2025Updated last year
EdVince / whisper-trtllm
View on GitHub
Whisper in TensorRT-LLM
☆17Sep 21, 2023Updated 2 years ago
liyucheng09 / llm-compressive
View on GitHub
Longitudinal Evaluation of LLMs via Data Compression
☆32May 29, 2024Updated 2 years ago
yuzhenmao / IceFormer
View on GitHub
Implementation for IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Jun 9, 2026Updated last month
triple-mu / HunyuanDiT-TensorRT-libtorch
View on GitHub
HunyuanDiT with TensorRT and libtorch
☆18May 22, 2024Updated 2 years ago
xxxxyu / FlexNN
View on GitHub
[MobiCom 24] Adaptive DNN inference under memory constraints
☆57Jan 22, 2025Updated last year
liwuhen / CVDeploy-2D
View on GitHub
The repository supports TensorRT, QNN platform inference, 2D obstacle detection yolo series (yolov5, yolov8, yolo11, yolox), semantic seg…
☆20May 6, 2025Updated last year
hopef / llama3_chat
View on GitHub
Llama3 Streaming Chat Sample
☆22Apr 24, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated last month
Infrasys-AI / aiinfra-docs
View on GitHub
☆21Nov 6, 2025Updated 8 months ago
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
kalfazed / multi-thread-programming
View on GitHub
This is a repository to practice multi-thread programming in C++
☆31Feb 21, 2024Updated 2 years ago
squest / zenx-integrated-learning
View on GitHub
Learning problem-solving, logic/set, math, physics, economics through functional programming using Haskell
☆19Oct 16, 2015Updated 10 years ago
triple-mu / Qwen-Image-TensorRT
View on GitHub
Qwen-Image's DiT inference with TensorRT-10
☆21Oct 13, 2025Updated 9 months ago
klange / toaru_jpeg
View on GitHub
C rewrite of a minimal Python JPEG decoder
☆12Jan 2, 2019Updated 7 years ago