shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆19Updated 3 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆20Updated 5 months ago
- High-performance C++ library for Fast Directional Chamfer Matching, optimized for template matching on untextured objects.☆13Updated 11 months ago
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆26Updated last week
- Memory-Bounded GPU Acceleration for Vector Search☆29Updated 3 weeks ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆53Updated last week
- Optimizing loading training data from cloud bucket storage for cloud-based distributed deep learning. Official repository for Quantifying…☆11Updated 3 years ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…☆11Updated last year
- Lightweight Llama 3 8B Inference Engine in CUDA C☆48Updated 7 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆110Updated last year
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- DINOv2 inference engine written in C/C++ using ggml and OpenCV.☆80Updated 6 months ago
- High-Performance SGEMM on CUDA devices☆109Updated 9 months ago
- Compression for Foundation Models☆35Updated 3 months ago
- A curated list for Efficient Large Language Models☆11Updated last year
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆203Updated this week
- Some CUDA design patterns and a bit of template magic for CUDA☆156Updated 2 years ago
- ☆18Updated last month
- Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake usin…☆30Updated 3 weeks ago
- A warp-oriented dynamic hash table for GPUs☆76Updated last year
- Abstractions of memory, allocator, vector, tuple, shared_ptr, unique_ptr, bitset, variant and string working on both CPU and GPU☆31Updated 2 months ago
- Zero-copy multimodal vector DB with CUDA and CLIP/SigLIP☆61Updated 6 months ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Updated 2 weeks ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆45Updated 4 months ago
- A Minimalistic Auto-Diff Optimization Framework for Teaching and Understanding Pytorch☆23Updated 2 weeks ago
- A set of hands-on tutorials for CUDA programming☆240Updated last year
- SYCL implementation of Fused MLPs for Intel GPUs☆48Updated 5 months ago
- ☆103Updated 5 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆100Updated 4 months ago
- ☆49Updated last month