xiaoyi018 / simple_gemmLinks

☆22

Alternatives and similar repositories for simple_gemm

Users that are interested in simple_gemm are comparing it to the libraries listed below

Sorting:

BBuf / how-to-optimize-gemm
☆98Updated 4 years ago
tpoisonooo / chgemm
symmetric int8 gemm
☆67Updated 5 years ago
HuangShiqing / LearnAndTry
☆19Updated last month
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆54Updated 3 years ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆132Updated 2 years ago
JieRen98 / SGEMM-SASS-Annotation
☆21Updated 4 years ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆40Updated 7 months ago
MegEngine / MegCC
MegCC是一个运行时超轻量，高效，移植简单的深度学习模型编译器
☆486Updated 11 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆143Updated last year
daquexian / faster-rwkv
☆125Updated last year
Cambricon / mlu-ops
Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .
☆134Updated last week
carlushuang / cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
☆72Updated 6 years ago
MARD1NO / CUDA-PPT
☆109Updated 6 months ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆62Updated 10 months ago
njuhope / cuda_sgemm
☆115Updated last year
tigert1998 / qat
Manually implemented quantization-aware training
☆21Updated 2 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 8 months ago
JackonYang / hands-on-tvm
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆50Updated 2 years ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
OpenPPL / ppl.pmx
☆59Updated 10 months ago
luchangli03 / onnxsim_large_model
simplify >2GB large onnx model
☆63Updated 10 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆110Updated last year
OpenPPL / ppl.nn.llm
☆140Updated last year
mlc-ai / notebooks
☆210Updated 10 months ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆101Updated 7 years ago
dianhsu / transformer-cpp-cpu
用C++实现一个简单的Transformer模型。 Attention Is All You Need。
☆50Updated 4 years ago
toyaix / TritonLLM
LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model
☆46Updated last month
FlagTree / libtriton_jit
A Triton JIT runtime and ffi provider in C++
☆25Updated 2 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆119Updated 4 months ago