ReinForce-II / mmapeakLinks
☆43Updated 5 months ago
Alternatives and similar repositories for mmapeak
Users that are interested in mmapeak are comparing it to the libraries listed below
Sorting:
- ☆152Updated 3 months ago
- NVIDIA Linux open GPU with P2P support☆54Updated this week
- ☆76Updated 8 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆73Updated 7 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆162Updated this week
- Gpu benchmark☆68Updated 7 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆541Updated last month
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer (WIP) for Triton Kernels☆150Updated last week
- Samples of good AI generated CUDA kernels☆90Updated 3 months ago
- Fast and memory-efficient exact attention☆189Updated this week
- Fast low-bit matmul kernels in Triton☆371Updated last week
- High-Performance SGEMM on CUDA devices☆101Updated 8 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆318Updated last week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆48Updated last year
- kernels, of the mega variety☆496Updated 3 months ago
- ☆97Updated last month
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 2 months ago
- ☆17Updated 9 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆194Updated last week
- AI Tensor Engine for ROCm☆279Updated this week
- Inference RWKV v7 in pure C.☆38Updated last month
- LLM Inference on consumer devices☆124Updated 6 months ago
- A collection of tricks and tools to speed up transformer models☆180Updated last week
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆191Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 11 months ago
- ☆57Updated 3 months ago
- AMD RAD's experimental RMA library for Triton.☆74Updated this week
- Inference of Mamba models in pure C☆192Updated last year
- scalable and robust tree-based speculative decoding algorithm☆358Updated 7 months ago