deepreinforce-ai / CUDA-L2Links
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
☆252Updated last week
Alternatives and similar repositories for CUDA-L2
Users that are interested in CUDA-L2 are comparing it to the libraries listed below
Sorting:
- Fast and Furious AMD Kernels☆324Updated last week
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆433Updated last week
- CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-base…☆649Updated last week
- Samples of good AI generated CUDA kernels☆95Updated 6 months ago
- High-Performance SGEMM on CUDA devices☆114Updated 11 months ago
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆183Updated last year
- LLM training in simple, raw C/CUDA☆108Updated last year
- Pytorch script hot swap: Change code without unloading your LLM from VRAM☆125Updated 8 months ago
- ☆461Updated last month
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆178Updated last week
- Helpful kernel tutorials and examples for tile-based GPU programming☆501Updated this week
- ☆113Updated last month
- Custom PTX Instruction Benchmark☆137Updated 10 months ago
- ☆219Updated 11 months ago
- ☆82Updated 3 weeks ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 9 months ago
- ☆115Updated 7 months ago
- Ship correct and fast LLM kernels to PyTorch☆127Updated last week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆177Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆64Updated last week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆86Updated last month
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆476Updated last month
- Learning about CUDA by writing PTX code.☆150Updated last year
- Hand-Rolled GPU communications library☆76Updated last month
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆64Updated 11 months ago
- kernels, of the mega variety☆634Updated 3 months ago
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆140Updated this week
- Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator☆215Updated 2 years ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 6 months ago