Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆49Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for cuda_hgemv
- Examples of CUDA implementations by Cutlass CuTe☆98Updated last week
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆52Updated 3 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆90Updated 4 months ago
- ☆79Updated 8 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency