fabiocannizzo / FastBinarySearchLinks
Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers
☆153Updated last year
Alternatives and similar repositories for FastBinarySearch
Users that are interested in FastBinarySearch are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆112Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- Implementation of the paper "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search" by Severo et al.☆89Updated 2 weeks ago
- Inference of Mamba and Mamba2 models in pure C☆196Updated this week
- A tracing JIT compiler for PyTorch☆13Updated 4 years ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆327Updated 3 weeks ago
- Simple high-throughput inference library☆155Updated 8 months ago
- Implementation of "Efficient Multi-vector Dense Retrieval with Bit Vectors", ECIR 2024☆68Updated 3 months ago
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆64Updated last year
- A minimalistic C++ Jinja templating engine for LLM chat templates☆202Updated 4 months ago
- Make triton easier☆50Updated last year
- ☆71Updated 10 months ago
- extensible collectives library in triton☆93Updated 9 months ago
- Clover: Quantized 4-bit Linear Algebra Library☆114Updated 7 years ago
- Nod.ai 🦈 version of 👻 . You probably want to start at https://github.com/nod-ai/shark for the product and the upstream IREE repository …☆107Updated last month
- Standalone commandline CLI tool for compiling Triton kernels☆20Updated last year
- A Python library transfers PyTorch tensors between CPU and NVMe☆125Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆234Updated last week
- ☆24Updated last year
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆356Updated 2 weeks ago
- Samples of good AI generated CUDA kernels☆99Updated 7 months ago
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆182Updated last month
- Benchmarks to capture important workloads.☆32Updated 2 weeks ago
- Gpu benchmark☆73Updated 11 months ago
- ☆27Updated 2 years ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆68Updated this week
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆74Updated 11 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆48Updated 5 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year