mag- / gpu_benchmarkLinks

Gpu benchmark

☆69

Alternatives and similar repositories for gpu_benchmark

Users that are interested in gpu_benchmark are comparing it to the libraries listed below

Sorting:

salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆107Updated 8 months ago
huggingface / kernel-builder
👷 Build compute kernels
☆158Updated this week
HazyResearch / train-tk
train with kittens!
☆63Updated 11 months ago
OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
☆182Updated last week
Cornell-RelaxML / yaqa-quantization
☆60Updated 3 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
kyleliang919 / Super_Muon
☆64Updated 6 months ago
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆97Updated 10 months ago
schwartz-lab-NLP / TOVA
Token Omission Via Attention
☆127Updated last year
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆91Updated 4 months ago
main-horse / hnet-old
H-Net Dynamic Hierarchical Architecture
☆80Updated last month
BlinkDL / fast.c
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆73Updated 8 months ago
IST-DASLab / Quartet
☆102Updated this week
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 6 months ago
gpu-mode / ring-attention
ring-attention experiments
☆153Updated 11 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
huggingface / kernels
Load compute kernels from the Hub
☆299Updated this week
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆33Updated 7 months ago
vdesai2014 / inference-optimization-blog-post
☆89Updated last year
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆167Updated last week
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
kuterd / opal_ptx
Experimental GPU language with meta-programming
☆23Updated last year
kroggen / mamba.c
Inference of Mamba models in pure C
☆191Updated last year
abacusai / gh200-llm
Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning
☆50Updated 7 months ago
lianakoleva / no-libtorch-compile
☆21Updated 7 months ago
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆96Updated 5 months ago
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆45Updated last year
UmerHA / triton_util
Make triton easier
☆48Updated last year
tridao / flash-attention-wheels
☆57Updated last year