shreshthkapai / cuda_latency_benchmarkLinks

High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.

☆19

Alternatives and similar repositories for cuda_latency_benchmark

Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below

Sorting:

abhisheknair10 / llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
☆53Updated 8 months ago
Libraries-Openly-Fused / FusedKernelLibrary
Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.
☆30Updated last week
sslab-gatech / SysGPT
Principles and Methodologies for Serial Performance Optimization (OSDI' 25)
☆20Updated 5 months ago
facebookresearch / loop_nest
Loop Nest - Linear algebra compiler and code generator.
☆21Updated 3 years ago
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆111Updated last year
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆112Updated 10 months ago
manishucsd / py-codegen
☆16Updated last year
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆102Updated 5 months ago
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆108Updated last year
SusCom-Lab / ZSMerge
☆19Updated 2 months ago
GusLovesMath / Llama3_MacSilicon
Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…
☆11Updated last year
HazyResearch / HipKittens
Fast and Furious AMD Kernels
☆298Updated this week
leimao / Nsight-Compute-Docker-Image
Nsight Compute In Docker
☆12Updated last year
NVIDIA / free-threaded-python
No-GIL Python environment featuring NVIDIA Deep Learning libraries.
☆69Updated 7 months ago
merrymercy / Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
☆11Updated last year
ahennequ / cuda-tensorcores-register-mapping
☆19Updated 3 years ago
SC-SGS / Distributed_GPU_LSH_using_SYCL
Distributed k-nearest Neighbors using Locality Sensitive Hashing and SYCL
☆10Updated 4 years ago
CisMine / Guide-NVIDIA-Tools
NVIDIA tools guide
☆149Updated 10 months ago
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆47Updated 3 months ago
facebookresearch / adaptive_scheduling
Experimental scripts for researching data adaptive learning rate scheduling.
☆22Updated 2 years ago
puttsk / cuda-tutorial
A set of hands-on tutorials for CUDA programming
☆241Updated last year
caijixueIT / CUDA_Learning_for_Freshman
☆14Updated 3 weeks ago
axonn-ai / axonn
Parallel framework for training and fine-tuning deep neural networks
☆69Updated 2 weeks ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆25Updated this week
benborder / drla
C++ Deep Reinforcement Learning Agent library
☆13Updated last year
fishmingyu / GeoT
GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU
☆23Updated 8 months ago
HabanaAI / Megatron-DeepSpeed
Intel Gaudi's Megatron DeepSpeed Large Language Models for training
☆15Updated 11 months ago
microsoft / AttentionEngine
☆111Updated 6 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆122Updated last year