shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆20Updated 6 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- We aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, i…☆46Updated this week
- LLM training in simple, raw C/CUDA☆112Updated last year
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆23Updated 7 months ago
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 10 months ago
- Fast and Furious AMD Kernels☆346Updated last week
- Parallel framework for training and fine-tuning deep neural networks☆70Updated 2 months ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆54Updated 2 months ago
- Compression for Foundation Models☆34Updated 6 months ago
- ☆87Updated last week
- ☆117Updated 3 weeks ago
- Quantized LLM training in pure CUDA/C++.☆233Updated last week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Updated 7 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆48Updated 5 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 7 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆46Updated last year
- SYCL implementation of Fused MLPs for Intel GPUs☆51Updated 2 months ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆75Updated 2 months ago
- ☆23Updated 6 months ago
- ☆117Updated 8 months ago
- This repository contains code for the MicroAdam paper.☆22Updated last year
- Pipeline parallelism for the minimalist☆38Updated 5 months ago
- Personal solutions to the Triton Puzzles☆20Updated last year
- Optimizing loading training data from cloud bucket storage for cloud-based distributed deep learning. Official repository for Quantifying…☆11Updated 4 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆114Updated last year
- General Matrix Multiplication using NVIDIA Tensor Cores☆28Updated last year
- High-performance C++ library for Fast Directional Chamfer Matching, optimized for template matching on untextured objects.☆18Updated last year
- ☆27Updated 2 years ago
- ☆38Updated last year