deepreinforce-ai / CUDA-L2Links
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
☆318Updated last week
Alternatives and similar repositories for CUDA-L2
Users that are interested in CUDA-L2 are comparing it to the libraries listed below
Sorting:
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆439Updated last month
- Fast and Furious AMD Kernels☆336Updated this week
- High-Performance SGEMM on CUDA devices☆115Updated 11 months ago
- Samples of good AI generated CUDA kernels☆99Updated 7 months ago
- CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-base…☆773Updated this week
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆184Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆185Updated this week
- Hand-Rolled GPU communications library☆76Updated last month
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆485Updated last month
- Custom PTX Instruction Benchmark☆137Updated 10 months ago
- ☆83Updated last month
- Ship correct and fast LLM kernels to PyTorch☆132Updated this week
- ☆218Updated 11 months ago
- LLM training in simple, raw C/CUDA☆110Updated last year
- mHC kernels implemented in CUDA☆217Updated this week
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆91Updated last week
- ☆117Updated 7 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆66Updated last week
- extensible collectives library in triton☆92Updated 9 months ago
- kernels, of the mega variety☆648Updated 3 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆48Updated 4 months ago
- Helpful kernel tutorials and examples for tile-based GPU programming☆568Updated this week
- ☆114Updated last week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆79Updated 2 weeks ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆188Updated 3 weeks ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆443Updated 2 weeks ago
- torchcomms: a modern PyTorch communications API☆320Updated this week
- Pytorch script hot swap: Change code without unloading your LLM from VRAM☆125Updated 8 months ago
- PyTorch memory allocation visualizer☆62Updated 6 months ago