tugrul512bit / TurtleSortLinks
Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.
☆11Updated 8 months ago
Alternatives and similar repositories for TurtleSort
Users that are interested in TurtleSort are comparing it to the libraries listed below
Sorting:
- Lightweight Llama 3 8B Inference Engine in CUDA C☆47Updated 3 months ago
- High-Performance SGEMM on CUDA devices☆95Updated 5 months ago
- 👷 Build compute kernels☆68Updated this week
- Estimating hardware and cloud costs of LLMs and transformer projects☆17Updated this week
- My submission for the GPUMODE/AMD fp8 mm challenge☆25Updated 3 weeks ago
- ☆10Updated 5 months ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆26Updated last week
- LLM training in simple, raw C/CUDA☆99Updated last year
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago
- Super fast FP32 matrix multiplication on RDNA3☆64Updated 2 months ago
- TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…☆93Updated last week
- Schola is a plugin for enabling Reinforcement Learning (RL) in Unreal Engine. It provides tools to help developers create environments, d…☆44Updated last month
- AMD’s C++ library for accelerating tensor primitives☆42Updated this week
- Make triton easier☆46Updated last year
- AI Tensor Engine for ROCm☆208Updated this week
- Gpu benchmark☆63Updated 4 months ago
- My CUDA solution to the 1BRC☆10Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 3 months ago
- ☆13Updated last year
- monorepo for rocm libraries☆24Updated this week
- ☆22Updated this week
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆47Updated this week
- tenstorrent kernel from twitch☆28Updated last year
- Next generation LAPACK implementation for ROCm platform☆103Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated this week
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆61Updated 2 months ago
- Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sort…☆15Updated last year
- Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake usin…☆26Updated 3 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆95Updated last month