tugrul512bit / TurtleSortLinks
Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.
☆13Updated last year
Alternatives and similar repositories for TurtleSort
Users that are interested in TurtleSort are comparing it to the libraries listed below
Sorting:
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- LLM training in simple, raw C/CUDA☆112Updated last year
- Fast and Furious AMD Kernels☆348Updated 2 weeks ago
- We aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, i…☆47Updated this week
- ☆53Updated 9 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Updated 7 months ago
- Super fast FP32 matrix multiplication on RDNA3☆82Updated 10 months ago
- Custom PTX Instruction Benchmark☆138Updated 11 months ago
- ☆15Updated 3 months ago
- Cute layout visualization☆29Updated 3 weeks ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 8 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆49Updated 5 months ago
- AI Tensor Engine for ROCm☆351Updated this week
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆57Updated 10 months ago
- extensible collectives library in triton☆95Updated 10 months ago
- ☆104Updated last year
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆75Updated last week
- Ahead of Time (AOT) Triton Math Library☆88Updated last week
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆417Updated last month
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆113Updated this week
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated this week
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆41Updated last year
- Hand-Rolled GPU communications library☆82Updated 2 months ago
- python package of rocm-smi-lib☆24Updated last month
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Updated 4 months ago
- ☆32Updated 7 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Updated 10 months ago
- MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…☆45Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Updated last year