shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆19Updated 4 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 8 months ago
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆30Updated last week
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆20Updated 5 months ago
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆111Updated last year
- High-Performance SGEMM on CUDA devices☆112Updated 10 months ago
- ☆16Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆102Updated 5 months ago
- LLM training in simple, raw C/CUDA☆108Updated last year
- ☆19Updated 2 months ago
- Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…☆11Updated last year
- Fast and Furious AMD Kernels☆298Updated this week
- Nsight Compute In Docker☆12Updated last year
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆69Updated 7 months ago
- A curated list for Efficient Large Language Models☆11Updated last year
- ☆19Updated 3 years ago
- Distributed k-nearest Neighbors using Locality Sensitive Hashing and SYCL☆10Updated 4 years ago
- NVIDIA tools guide☆149Updated 10 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 3 months ago
- Experimental scripts for researching data adaptive learning rate scheduling.☆22Updated 2 years ago
- A set of hands-on tutorials for CUDA programming☆241Updated last year
- ☆14Updated 3 weeks ago
- Parallel framework for training and fine-tuning deep neural networks☆69Updated 2 weeks ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆25Updated this week
- C++ Deep Reinforcement Learning Agent library☆13Updated last year
- GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU☆23Updated 8 months ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆15Updated 11 months ago
- ☆111Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆122Updated last year