salykova / sgemm.c

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

☆348

Alternatives and similar repositories for sgemm.c:

Users that are interested in sgemm.c are comparing it to the libraries listed below

unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆128Updated last year
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆92Updated 11 months ago
ulrichstern / cuda-convnet
Alex Krizhevsky's original code from Google Code
☆191Updated 9 years ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆90Updated 3 months ago
a1k0n / a1gpt
throwaway GPT inference
☆138Updated 10 months ago
MarioSieg / magnetron
(WIP) A small but powerful, homemade PyTorch from scratch.
☆543Updated last week
mlecauchois / micrograd-cuda
☆241Updated last year
mesozoic-egg / tinygrad-notes
Tutorials on tinygrad
☆370Updated last month
apoorvnandan / lilgrad
pytorch from scratch in pure C/CUDA and python
☆40Updated 6 months ago
pranjalssh / fast.cu
Fastest kernels written from scratch
☆236Updated 3 weeks ago
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆697Updated last year
tgautam03 / xGeMM
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
☆114Updated 3 months ago
kuterd / nv_isa_solver
Nvidia Instruction Set Specification Generator
☆256Updated 9 months ago
smolorg / smolgrad
small auto-grad engine inspired from Karpathy's micrograd and PyTorch
☆252Updated 5 months ago
Maharshi-Pandya / cudacodes
Learnings and programs related to CUDA
☆379Updated 2 months ago
LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆122Updated last month
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆534Updated this week
nreHieW / r-nn
Tensor library with autograd using only Rust's standard library
☆67Updated 9 months ago
clu0 / unet.cu
UNet diffusion model in pure CUDA
☆602Updated 9 months ago
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆179Updated last year
lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
☆150Updated 10 months ago
rkinas / triton-resources
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
☆339Updated last month
astledsa / Deep-Learning-C
☆47Updated 3 weeks ago
zeux / calm
CUDA/Metal accelerated language model inference
☆541Updated 2 weeks ago
andrewkchan / yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆283Updated 3 months ago
drkennetz / cuda_examples
Some CUDA example code with READMEs.
☆94Updated last month
gpu-mode / reference-kernels
Reference Kernels for the Leaderboard
☆33Updated last week
wentasah / mmul-anim
Visualization of cache-optimized matrix multiplication
☆120Updated last month
CisMine / Guide-NVIDIA-Tools
NVIDIA tools guide
☆129Updated 3 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆130Updated last year