tugrul512bit / TurtleSortLinks
Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.
☆12Updated last year
Alternatives and similar repositories for TurtleSort
Users that are interested in TurtleSort are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆112Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- We aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, i…☆47Updated this week
- Fast and Furious AMD Kernels☆348Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 10 months ago
- ☆15Updated 3 months ago
- ☆53Updated 9 months ago
- Hand-Rolled GPU communications library☆81Updated 2 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆96Updated last month
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆440Updated last month
- Ship correct and fast LLM kernels to PyTorch☆140Updated 3 weeks ago
- Experimental GPU language with meta-programming☆24Updated last year
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆49Updated 5 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆93Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆200Updated last week
- ☆219Updated last year
- Super fast FP32 matrix multiplication on RDNA3☆82Updated 10 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Updated 7 months ago
- ☆38Updated last year
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆141Updated 4 months ago
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆417Updated last month
- ☆32Updated 7 months ago
- extensible collectives library in triton☆95Updated 10 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆68Updated this week
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆67Updated 3 weeks ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆75Updated last week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆87Updated last week
- Helpful kernel tutorials and examples for tile-based GPU programming☆630Updated this week
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆461Updated last month
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 8 months ago