gevtushenko / llm.c

LLM training in simple, raw C/CUDA

☆86

Related projects ⓘ

Alternatives and complementary repositories for llm.c

pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆34Updated 5 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆140Updated this week
gpu-mode / ring-attention
ring-attention experiments
☆95Updated 3 weeks ago
cchan / tccl
extensible collectives library in triton
☆61Updated last month
mobiusml / gemlite
Simple and fast low-bit matmul kernels in CUDA / Triton
☆133Updated this week
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆95Updated last year
moritztng / grayskull-attention
Attention in SRAM on Tenstorrent Grayskull
☆29Updated 3 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆132Updated 2 months ago
google / jaxonnxruntime
A user-friendly tool chain that enables the seamless execution of ONNX models using JAX as the backend.
☆98Updated last month
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆159Updated last week
HanGuo97 / flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆183Updated last month
apple / ml-recurrent-drafter
☆96Updated last month
salykova / matmul.c
Fast, Multi-threaded Matrix Multiplication in C
☆178Updated 3 weeks ago
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆28Updated 8 months ago
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆208Updated this week
UmerHA / triton_util
Make triton easier
☆41Updated 4 months ago
gau-nernst / quantized-training
Explore training for quantized models
☆10Updated 2 weeks ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆63Updated last week
triton-lang / kernels
☆43Updated this week
thevasudevgupta / gpt-triton
Triton implementation of GPT/LLAMA
☆15Updated 2 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆55Updated last week
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆26Updated last month
microsoft / DeepSpeed-Kernels
☆55Updated 5 months ago
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆211Updated 3 months ago
lianakoleva / no-libtorch-compile
☆17Updated 2 weeks ago
gpu-mode / profiling-cuda-in-torch
☆133Updated 9 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆87Updated 3 months ago
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆157Updated last year
ColfaxResearch / cutlass-kernels
☆162Updated 3 months ago
mag- / gpu_benchmark
Gpu benchmark
☆43Updated last month