tugrul512bit / TurtleSort
Multi-heap-sort for many small arrays, quicksort with 3 pivots for one big array, CUDA acceleration, CUDA memory compression.
☆11Updated 7 months ago
Alternatives and similar repositories for TurtleSort:
Users that are interested in TurtleSort are comparing it to the libraries listed below
- LLM training in simple, raw C/CUDA☆94Updated last year
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆47Updated last month
- Schola is a plugin for enabling Reinforcement Learning (RL) in Unreal Engine. It provides tools to help developers create environments, d…☆35Updated last month
- 👷 Build compute kernels☆37Updated last week
- A minimalistic C++ Jinja templating engine for LLM chat templates☆137Updated this week
- The Farm-SVE package provides a header that implements the ARM C language extensions (ACLE) for the ARM Scalable Vector Extension (SVE) i…☆14Updated last year
- Distributed ranges is a generalization of C++ ranges for distributed data structures.☆50Updated last week
- JAX bindings for the flash-attention3 kernels☆11Updated 9 months ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆44Updated this week
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated last month
- NVIDIA HPCG is based on the HPCG benchmark and optimized for performance on NVIDIA accelerated HPC systems.☆54Updated 2 weeks ago
- tenstorrent kernel from twitch☆27Updated last year
- A list of awesome resources and blogs on topics related to Unum☆40Updated 6 months ago
- If only std::set was a DBMS: collection of templated ACID in-memory exception-free thread-safe and concurrent containers in a header-only…☆40Updated 2 years ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆137Updated 4 months ago
- Rust crates for XetHub☆43Updated 6 months ago
- ☆13Updated 10 months ago
- asynchronous/distributed speculative evaluation for llama3☆39Updated 9 months ago
- C99-compatible library for efficiently parking threads on all major operating systems☆11Updated last month
- Custom PTX Instruction Benchmark☆123Updated 2 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆38Updated 2 weeks ago
- C++20 idiomatic APIs for the Apache Arrow Columnar Format☆86Updated this week
- Intel® SHMEM - Device initiated shared memory based communication library☆23Updated last month
- pytorch from scratch in pure C/CUDA and python☆40Updated 7 months ago
- LLM inference in Fortran☆58Updated 11 months ago
- Guides and examples to help achieve optimal performance on a NVIDIA Grace CPU☆13Updated 9 months ago
- ☆31Updated last week
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆86Updated this week