shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆19Updated 5 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆21Updated 7 months ago
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆37Updated this week
- Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…☆11Updated last year
- High-Performance SGEMM on CUDA devices☆115Updated 11 months ago
- Fast and Furious AMD Kernels☆331Updated 2 weeks ago
- Compression for Foundation Models☆35Updated 5 months ago
- ☆19Updated 3 months ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 9 months ago
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- Memory-Bounded GPU Acceleration for Vector Search☆32Updated 2 weeks ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Updated 7 months ago
- ☆32Updated last year
- ☆38Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile☆18Updated 2 years ago
- NVIDIA tools guide☆152Updated last year
- ☆116Updated 7 months ago
- ☆16Updated last year
- A clean, modular implementation of the Proximal Policy Optimization (PPO) algorithm in PyTorch, written with a strong focus on readabilit…☆19Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆114Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆104Updated 6 months ago
- Hand-Rolled GPU communications library☆76Updated last month
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆153Updated last year
- Quantize transformers to any learned arbitrary 4-bit numeric format☆50Updated 6 months ago
- A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.☆24Updated this week
- Experimental scripts for researching data adaptive learning rate scheduling.☆22Updated 2 years ago
- Abstractions of memory, allocator, vector, tuple, shared_ptr, unique_ptr, bitset, variant and string working on both CPU and GPU☆31Updated 5 months ago
- The proposal of this work involves a simulation of an ant colony swarm that was applied to a problem of search and rescue of objects of i…☆12Updated 2 years ago
- ☆27Updated 2 years ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆182Updated this week