tugrul512bit / TurtleSortLinks

Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.

☆12

Alternatives and similar repositories for TurtleSort

Users that are interested in TurtleSort are comparing it to the libraries listed below

Sorting:

gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆99Updated last year
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆97Updated 5 months ago
SciCompMod / GMGPolar
High Order Geometric Multigrid for planes in curvilinear coordinates
☆16Updated 3 weeks ago
okuvshynov / llama_duo
asynchronous/distributed speculative evaluation for llama3
☆39Updated 11 months ago
Snektron / gpumode-amd-fp8-mm
My submission for the GPUMODE/AMD fp8 mm challenge
☆27Updated last month
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆84Updated last month
seb-v / fp32_sgemm_amd
Super fast FP32 matrix multiplication on RDNA3
☆68Updated 3 months ago
huggingface / kernel-builder
👷 Build compute kernels
☆77Updated this week
ROCm / hipRAND
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆26Updated this week
ROCm / hipBLASLt
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆110Updated this week
google / minja
A minimalistic C++ Jinja templating engine for LLM chat templates
☆160Updated this week
tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated last week
minnukota381 / Quantum-Computing-Qiskit
This repository contains a collection of Jupyter Notebooks demonstrating various quantum computing concepts using Qiskit, a popular quant…
☆11Updated 10 months ago
intel / compute-aggregation-layer
Compute Aggregation Layer for oneAPI Level Zero and OpenCL(TM) Applications
☆17Updated 2 weeks ago
rbga / CUDA-Merge-and-Bitonic-Sort
Efficient implementations of Merge Sort and Bitonic Sort algorithms using CUDA for GPU parallel processing, resulting in accelerated sort…
☆16Updated last year
xjdr-alt / mla_blog_translation
☆13Updated last year
MagellaX / StreamAttn
☆22Updated last month
PrimeIntellect-ai / pccl
PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP
☆96Updated last month
spectral-compute / scale-examples
☆59Updated last year
isEmmanuelOlowe / llm-cost-estimator
Estimating hardware and cloud costs of LLMs and transformer projects
☆18Updated 3 weeks ago
ROCm / aiter
AI Tensor Engine for ROCm
☆232Updated this week
ROCm / hipTensor
AMD’s C++ library for accelerating tensor primitives
☆43Updated this week
abhisheknair10 / llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
☆47Updated 3 months ago
amd / fuzzyHSA
☆54Updated last year
ROCm / rocm_bandwidth_test
Bandwidth test for ROCm
☆60Updated this week
EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆87Updated this week
LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆126Updated 4 months ago
wozeparrot / tinyrwkv
tinygrad port of the RWKV large language model.
☆45Updated 4 months ago
geohot / tt-tiny
tiny code to access tenstorrent blackhole
☆55Updated last month
ProjectPhysX / PTXprofiler
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
☆55Updated 3 months ago