deepreinforce-ai / CUDA-L2Links
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
☆142Updated this week
Alternatives and similar repositories for CUDA-L2
Users that are interested in CUDA-L2 are comparing it to the libraries listed below
Sorting:
- Pytorch script hot swap: Change code without unloading your LLM from VRAM☆125Updated 7 months ago
- Tensor library & inference framework for machine learning☆114Updated 2 months ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 8 months ago
- Fast and Furious AMD Kernels☆309Updated last week
- Heirarchical Navigable Small Worlds☆101Updated 3 months ago
- A minimalistic C++ Jinja templating engine for LLM chat templates☆198Updated 2 months ago
- Algebraic enhancements for GEMM & AI accelerators☆282Updated 9 months ago
- ☆199Updated 7 months ago
- Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator☆214Updated last year
- LLM training in simple, raw C/CUDA☆108Updated last year
- High-Performance SGEMM on CUDA devices☆113Updated 10 months ago
- Train neural networks that distill into logic circuits, using JAX☆63Updated 5 months ago
- Inference of Mamba models in pure C☆194Updated last year
- Samples of good AI generated CUDA kernels☆92Updated 6 months ago
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆184Updated last year
- The Quasi Quantum Assembly Programming Language☆36Updated 3 weeks ago
- tiny code to access tenstorrent blackhole☆61Updated 6 months ago
- ☆456Updated last week
- ☆191Updated last year
- Pivotal Token Search☆132Updated this week
- Richard is gaining power☆199Updated 5 months ago
- GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research☆34Updated last year
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆153Updated 11 months ago
- Simple high-throughput inference library☆150Updated 6 months ago
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆408Updated this week
- A GPU Accelerated Binary Vector Store☆47Updated 9 months ago
- asynchronous/distributed speculative evaluation for llama3☆39Updated last year
- C++ raytracer that supports custom models. Supports running the calculations on the CPU using C++11 threads or in the GPU via CUDA.☆74Updated 2 years ago
- throwaway GPT inference☆141Updated last year
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆220Updated 2 weeks ago