tugrul512bit / TurtleSortLinks
Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.
☆12Updated 9 months ago
Alternatives and similar repositories for TurtleSort
Users that are interested in TurtleSort are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆99Updated last year
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- High Order Geometric Multigrid for planes in curvilinear coordinates☆16Updated 3 weeks ago
- asynchronous/distributed speculative evaluation for llama3☆39Updated 11 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated last month
- Samples of good AI generated CUDA kernels☆84Updated last month
- Super fast FP32 matrix multiplication on RDNA3☆68Updated 3 months ago
- 👷 Build compute kernels☆77Updated this week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆26Updated this week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆110Updated this week
- A minimalistic C++ Jinja templating engine for LLM chat templates☆160Updated this week
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Updated last week
- This repository contains a collection of Jupyter Notebooks demonstrating various quantum computing concepts using Qiskit, a popular quant…☆11Updated 10 months ago
- Compute Aggregation Layer for oneAPI Level Zero and OpenCL(TM) Applications☆17Updated 2 weeks ago
- Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sort…☆16Updated last year
- ☆13Updated last year
- ☆22Updated last month
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆96Updated last month
- ☆59Updated last year
- Estimating hardware and cloud costs of LLMs and transformer projects☆18Updated 3 weeks ago
- AI Tensor Engine for ROCm☆232Updated this week
- AMD’s C++ library for accelerating tensor primitives☆43Updated this week
- Lightweight Llama 3 8B Inference Engine in CUDA C☆47Updated 3 months ago
- ☆54Updated last year
- Bandwidth test for ROCm☆60Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- Custom PTX Instruction Benchmark☆126Updated 4 months ago
- tinygrad port of the RWKV large language model.☆45Updated 4 months ago
- tiny code to access tenstorrent blackhole☆55Updated last month
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆55Updated 3 months ago