shreshthkapai / cuda_latency_benchmarkLinks
High-performance CUDA kernels for real-time financial low latency inference, optimized for both consumer and datacenter GPUs.
☆19Updated 4 months ago
Alternatives and similar repositories for cuda_latency_benchmark
Users that are interested in cuda_latency_benchmark are comparing it to the libraries listed below
Sorting:
- Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.☆32Updated last week
- Principles and Methodologies for Serial Performance Optimization (OSDI' 25)☆20Updated 6 months ago
- Experimental GPU language with meta-programming☆24Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Code and data for paper "(How) do Language Models Track State?"☆21Updated 8 months ago
- High-Performance SGEMM on CUDA devices☆113Updated 10 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆178Updated last week
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 8 months ago
- A clean, modular implementation of the Proximal Policy Optimization (PPO) algorithm in PyTorch, written with a strong focus on readabilit…☆19Updated last year
- Quantized LLM training in pure CUDA/C++.☆221Updated this week
- ☆19Updated 2 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 5 months ago
- Samples of good AI generated CUDA kernels☆94Updated 6 months ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆275Updated last month
- ☆15Updated 7 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆113Updated last year
- Make triton easier☆49Updated last year
- Repository for running LLMs efficiently on Mac silicon (M1, M2, M3). Features Jupyter notebook for Meta-Llama-3 setup using MLX framework…☆11Updated last year
- ☆39Updated last year
- Compression for Foundation Models☆34Updated 4 months ago
- LLM training in simple, raw C/CUDA☆108Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Optimizing loading training data from cloud bucket storage for cloud-based distributed deep learning. Official repository for Quantifying…☆11Updated 3 years ago
- ☆27Updated last year
- Quantize transformers to any learned arbitrary 4-bit numeric format☆50Updated 5 months ago
- PyTorch centric eager mode debugger☆48Updated last year
- Loop Nest - Linear algebra compiler and code generator.☆21Updated 3 years ago
- extensible collectives library in triton☆91Updated 8 months ago
- Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning☆54Updated last month
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆27Updated this week