ReinForce-II / mmapeakLinks
☆48Updated last month
Alternatives and similar repositories for mmapeak
Users that are interested in mmapeak are comparing it to the libraries listed below
Sorting:
- NVIDIA Linux open GPU with P2P support☆103Updated last month
- ☆162Updated 6 months ago
- DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference☆589Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆228Updated this week
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆73Updated 11 months ago
- Fast and memory-efficient exact attention☆207Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆50Updated last year
- ☆79Updated last year
- Gpu benchmark☆73Updated 11 months ago
- Fast low-bit matmul kernels in Triton☆418Updated 3 weeks ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆277Updated 2 months ago
- AI Tensor Engine for ROCm☆330Updated this week
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆182Updated this week
- ☆114Updated last week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆741Updated 5 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated last year
- kernels, of the mega variety☆640Updated 3 months ago
- High-Performance SGEMM on CUDA devices☆115Updated 11 months ago
- ☆270Updated last week
- 👷 Build compute kernels☆198Updated 2 weeks ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆671Updated 8 months ago
- ☆219Updated 11 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 5 months ago
- LLM Inference on consumer devices☆128Updated 9 months ago
- ☆18Updated last year
- ☆83Updated last month
- Ship correct and fast LLM kernels to PyTorch☆130Updated this week
- ☆69Updated 6 months ago
- Samples of good AI generated CUDA kernels☆99Updated 7 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆94Updated this week