shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆17Updated 2 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆108Updated last year
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆25Updated 2 weeks ago
- ☆40Updated 3 weeks ago
- Optimizing loading training data from cloud bucket storage for cloud-based distributed deep learning. Official repository for Quantifying…☆11Updated 3 years ago
- Exploration into the Firefly algorithm in Pytorch☆41Updated 8 months ago
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆16Updated 4 months ago
- Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…☆11Updated last year
- Samples of good AI generated CUDA kernels☆91Updated 4 months ago
- High-Performance SGEMM on CUDA devices☆107Updated 8 months ago
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 2 years ago
- ☆28Updated 8 months ago
- NVIDIA tools guide☆143Updated 9 months ago
- Make triton easier☆48Updated last year
- Experimental scripts for researching data adaptive learning rate scheduling.☆22Updated last year
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆53Updated 3 weeks ago
- ☆38Updated last year
- Code and data for paper "(How) do Language Models Track State?"☆19Updated 6 months ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆48Updated 6 months ago
- SYCL implementation of Fused MLPs for Intel GPUs☆47Updated 4 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆45Updated last month
- Pipeline parallelism for the minimalist☆35Updated 2 months ago
- ☆19Updated 3 years ago
- ☆100Updated 4 months ago
- Experimental GPU language with meta-programming☆23Updated last year
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆151Updated 9 months ago
- Memory-Bounded GPU Acceleration for Vector Search☆27Updated 6 months ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆64Updated 6 months ago
- Self contained pytorch implementation of a sinkhorn based router, for mixture of experts or otherwise☆39Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆98Updated 3 months ago