caijixueIT/CUDA_Learning_for_Freshman

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/caijixueIT/CUDA_Learning_for_Freshman)

caijixueIT / CUDA_Learning_for_Freshman

☆14

Alternatives and similar repositories for CUDA_Learning_for_Freshman

Users that are interested in CUDA_Learning_for_Freshman are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
weishengying / cute_gemm
View on GitHub
☆23Aug 14, 2024Updated last year
KuangjuX / cuda-evolve-oss
View on GitHub
Autonomous GPU kernel optimization system driven by AI agents.
☆31Mar 29, 2026Updated 3 months ago
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
l-sf / Nanodet_openvino_quant_deploy
View on GitHub
本仓库在OpenVINO推理框架下部署Nanodet检测算法，并重写预处理和后处理部分，具有超高性能！让你在Intel CPU平台上的检测速度起飞！并基于NNCF和PPQ工具将模型量化(PTQ)至int8精度，推理速度更快！
☆16Jun 14, 2023Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
poad42 / cuda-fp8-ampere
View on GitHub
IMMA-based **FP8-as-storage** GEMM experiments for Ampere (sm_86 / RTX 3090 Ti).
☆24Jan 30, 2026Updated 5 months ago
Yangxiaoz / GGML-Tutorial
View on GitHub
To better understand the ggml library
☆30Jun 13, 2025Updated last year
latentCall145 / channels-last-groupnorm
View on GitHub
A CUDA kernel for NHWC GroupNorm for PyTorch
☆23Nov 15, 2024Updated last year
dsl-learn / cutile-learn
View on GitHub
NVIDIA cuTile learn
☆169Dec 9, 2025Updated 7 months ago
sablin39 / tilelang-cuda-skills
View on GitHub
Skills for writing tilelang and debugging with CUDA toolkits.
☆133May 20, 2026Updated 2 months ago
aikitoria / nanotrace
View on GitHub
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
☆137Jul 17, 2026Updated last week
JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆21Aug 21, 2025Updated 11 months ago
enyac-group / Elana
View on GitHub
Elana: A Simple Energy & Latency Analyzer for LLMs
☆16Apr 3, 2026Updated 3 months ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 6 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
XdpAreKid / ncnn_paddleocrv4
View on GitHub
☆21Mar 5, 2024Updated 2 years ago
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆47Jun 11, 2025Updated last year
eth-cscs / Tiled-MM
View on GitHub
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
☆33Apr 2, 2025Updated last year
HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Apr 14, 2026Updated 3 months ago
rathaumons / pyppbox
View on GitHub
Toolbox for people detecting, tracking, and re-identifying.
☆18May 26, 2026Updated last month
cvdong / TRT_PRO_LEARN
View on GitHub
对 tensorRT_Pro 开源项目理解
☆22Feb 23, 2023Updated 3 years ago
m0dulo / Kaleidoscope
View on GitHub
🐲 LLVM-based Kaleidoscope language compiler ✨ 基于 LLVM 的 Kaleidoscope 编译器
☆12Dec 16, 2022Updated 3 years ago
timudk / flux_triton
View on GitHub
Writing FLUX in Triton
☆42Sep 22, 2024Updated last year
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated last month
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
BBuf / tensorrt-llm-moe
View on GitHub
☆34Feb 3, 2025Updated last year
EdVince / llm-cpp
View on GitHub
☆34Jul 23, 2024Updated 2 years ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆270Jul 11, 2024Updated 2 years ago
kalfazed / multi-thread-programming
View on GitHub
This is a repository to practice multi-thread programming in C++
☆31Feb 21, 2024Updated 2 years ago
diego-aquino / mediakit
View on GitHub
Download YouTube videos fast, directly from the command line
☆12Jun 3, 2022Updated 4 years ago
caiwanxianhust / FasterLLaMA
View on GitHub
使用 CUDA C++ 实现的 llama 模型推理框架
☆64Nov 8, 2024Updated last year
yt605155624 / TTSAndroid
View on GitHub
TTS Android demo of PaddleSpeech, merged into https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos
☆28Nov 30, 2022Updated 3 years ago
IST-DASLab / MicroAdam
View on GitHub
This repository contains code for the MicroAdam paper.
☆21Dec 14, 2024Updated last year
raymond1123 / hgemm
View on GitHub
☆30Nov 16, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
DataXujing / YOLOv6
View on GitHub
手摸手美团 YOLOv6模型训练和TensorRT端到端部署方案教程
☆34Jun 30, 2022Updated 4 years ago
jundaf2 / CUDA-INT8-GEMM
View on GitHub
CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API
☆37Sep 15, 2023Updated 2 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
kberkay / Cuda-Matrix-Multiplication
View on GitHub
Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts
☆26Aug 29, 2022Updated 3 years ago
Adlik / model_zoo
View on GitHub
☆11Dec 26, 2025Updated 7 months ago
globaledgesoft / Unsupported-Operation-Development-in-SNPE
View on GitHub
This project is intended to build and deploy an SNPE model on Qualcomm Devices, which are having unsupported layers which are not part of…
☆10Oct 4, 2021Updated 4 years ago
cherichy / tilecute
View on GitHub
☆32Jul 2, 2025Updated last year