aikitoria / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆15Updated last week
Alternatives and similar repositories for open-gpu-kernel-modules:
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
- AI Tensor Engine for ROCm☆142Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆37Updated 7 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆222Updated this week
- Development repository for the Triton language and compiler☆114Updated this week
- Gpu benchmark☆57Updated 2 months ago
- ☆112Updated this week
- High-Performance SGEMM on CUDA devices☆87Updated 2 months ago
- ☆54Updated 9 months ago
- LLM training in simple, raw C/HIP for AMD GPUs☆44Updated 6 months ago
- Fast low-bit matmul kernels in Triton☆272Updated this week
- Fast and memory-efficient exact attention☆163Updated this week
- GPU Power and Performance Manager☆57Updated 5 months ago
- Bamboo-7B Large Language Model☆92Updated last year
- AMD related optimizations for transformer models☆71Updated 4 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆70Updated last month
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- LLM inference in C/C++☆67Updated this week
- LLM Inference on consumer devices☆102Updated 2 weeks ago
- ☆40Updated last year
- Tcurtsni: Reverse Instruction Chat, ever wonder what your LLM wants to ask you?☆21Updated 9 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated 2 weeks ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆102Updated 8 months ago
- ☆65Updated 3 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 5 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆235Updated last month
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆60Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆91Updated this week
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆76Updated 3 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- Ahead of Time (AOT) Triton Math Library☆56Updated last week