CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
☆436Mar 30, 2026Updated 2 weeks ago
Alternatives and similar repositories for CUDA-L2
Users that are interested in CUDA-L2 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆67Apr 26, 2025Updated 11 months ago
- Paging Debug tool for GDB using python☆13Jun 4, 2022Updated 3 years ago
- ☆13Dec 9, 2024Updated last year
- VortexNet: Neural Computing through Fluid Dynamics☆49Jan 19, 2025Updated last year
- ☆57Feb 24, 2026Updated last month
- Deploy open-source AI quickly and easily - Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆122Jun 14, 2025Updated 10 months ago
- Automated High-Performance GPU Kernel Generation☆95Apr 11, 2026Updated last week
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆67Oct 2, 2025Updated 6 months ago
- ☆31Jun 7, 2025Updated 10 months ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆19Feb 9, 2026Updated 2 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆931Mar 24, 2026Updated 3 weeks ago
- High Performance Int8 GEMM Kernels for SM80 and later GPUs.☆22Mar 11, 2025Updated last year
- Accelerating MoE with IO and Tile-aware Optimizations☆630Apr 1, 2026Updated 2 weeks ago
- ☆119May 19, 2025Updated 10 months ago
- Deploy open-source AI quickly and easily - Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Apr 1, 2026Updated 2 weeks ago
- Fast and Furious AMD Kernels☆394Apr 11, 2026Updated last week
- CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-base…☆933Apr 1, 2026Updated 2 weeks ago
- A lightweight design for computation-communication overlap.☆226Jan 20, 2026Updated 2 months ago
- ☆23Apr 7, 2026Updated last week
- An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale☆237Updated this week
- ☆18Updated this week
- libsmctrl论文的复现,添加了python端接口,可以在python端灵活调用接口来分配计算资源☆12May 21, 2024Updated last year
- GPTQ inference TVM kernel☆40Apr 25, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Advancing the frontier of efficient AI☆58Apr 6, 2026Updated last week
- Nsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools☆192Updated this week
- Go based utilities for working with Apache Mesos Frameworks☆12Apr 22, 2016Updated 9 years ago
- A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.☆59Updated this week
- Tile primitives for speedy kernels☆3,312Apr 8, 2026Updated last week
- Debug print operator for cudagraph debugging☆14Aug 2, 2024Updated last year
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆18Jan 15, 2025Updated last year
- SystemVerilog implemention of the TAGE branch predictor☆14May 26, 2021Updated 4 years ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆835Updated this week
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 5 months ago
- Compiler plugin for performance analysis of HIP applications☆13Apr 7, 2025Updated last year
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆64Mar 25, 2025Updated last year
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- Distributed Compiler based on Triton for Parallel Systems☆1,403Apr 10, 2026Updated last week
- A command-line interface tool for creating, managing, and verifying Content Provenance and Authenticity (C2PA) manifests for machine lear…☆21Updated this week
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆294May 1, 2025Updated 11 months ago