This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
☆43Sep 29, 2025Updated 5 months ago
Alternatives and similar repositories for conv_op_optimization
Users that are interested in conv_op_optimization are comparing it to the libraries listed below
Sorting:
- GPU implementation of Winograd convolution☆10Oct 23, 2017Updated 8 years ago
- A Winograd Minimal Filter Implementation in CUDA☆28Aug 25, 2021Updated 4 years ago
- ☆42Nov 1, 2025Updated 4 months ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- ☆12Aug 31, 2023Updated 2 years ago
- ☆159Dec 26, 2024Updated last year
- ☆30Nov 16, 2024Updated last year
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆526Sep 8, 2024Updated last year
- ☆14Nov 3, 2025Updated 3 months ago
- Flash Attention in raw Cuda C beating PyTorch☆37May 14, 2024Updated last year
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆35Sep 15, 2023Updated 2 years ago
- ☆11Feb 28, 2023Updated 3 years ago
- Fast GPU based tensor core reductions☆13Jan 13, 2023Updated 3 years ago
- ☆62Feb 15, 2026Updated 2 weeks ago
- ggml学习笔记,ggml是一个机器学习的推理框架☆18Mar 24, 2024Updated last year
- ☆115May 16, 2025Updated 9 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Aug 31, 2023Updated 2 years ago
- New batched algorithm for sparse matrix-matrix multiplication (SpMM)☆16May 7, 2019Updated 6 years ago
- ☆14Jun 30, 2021Updated 4 years ago
- Mirror of http://gitlab.hpcrl.cse.ohio-state.edu/chong/ppopp19_ae, refactoring for understanding☆15Oct 20, 2021Updated 4 years ago
- A simple high performance CUDA GEMM implementation.☆426Jan 4, 2024Updated 2 years ago
- 本仓库在OpenVINO推理框架下部署Nanodet检测算法,并重写预处理和后处理部分,具有超高性能!让你在Intel CPU平台上的检测速度起飞! 并基于NNCF和PPQ工具将模型量化(PTQ)至int8精度,推理速度更快!☆16Jun 14, 2023Updated 2 years ago
- Implementation and optimization of matrix multiplication on single CPU (HPC-THU-2023-Autumn)☆18Feb 27, 2024Updated 2 years ago
- ☆49Apr 15, 2024Updated last year
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆58Aug 12, 2024Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆407Jan 2, 2025Updated last year
- A Python script to convert the output of NVIDIA Nsight Systems (in SQLite format) to JSON in Google Chrome Trace Event Format.☆55Aug 5, 2025Updated 6 months ago
- Implement Flash Attention using Cute.☆101Dec 17, 2024Updated last year
- ☆22Mar 5, 2024Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆28May 14, 2024Updated last year
- Optimize softmax in triton in many cases☆23Sep 6, 2024Updated last year
- Gensis is a lightweight deep learning framework written from scratch in Python, with Triton as its backend for high-performance computing…☆37Jan 15, 2026Updated last month
- 对 tensorRT_Pro 开源项目理解☆22Feb 23, 2023Updated 3 years ago
- CPU Memory Compiler and Parallel programing☆26Nov 18, 2024Updated last year
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Sep 8, 2024Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆64Nov 8, 2024Updated last year
- ☆119Apr 2, 2025Updated 11 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago
- Fast CUDA matrix multiplication from scratch☆1,060Sep 2, 2025Updated 6 months ago