aikitoria / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆17Updated last month
Alternatives and similar repositories for open-gpu-kernel-modules:
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
- ☆126Updated last month
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆292Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆41Updated 8 months ago
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆71Updated 2 months ago
- LLM Inference on consumer devices☆105Updated last month
- Fast low-bit matmul kernels in Triton☆291Updated this week
- ☆66Updated 3 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆100Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- Development repository for the Triton language and compiler☆118Updated this week
- Gpu benchmark☆59Updated 2 months ago
- AI Tensor Engine for ROCm☆180Updated this week
- Fast and memory-efficient exact attention☆171Updated this week
- KV cache compression for high-throughput LLM inference☆126Updated 2 months ago
- Perplexity GPU Kernels☆235Updated 2 weeks ago
- ☆54Updated 10 months ago
- Linux based GDDR6/GDDR6X VRAM temperature reader for NVIDIA RTX 3000/4000 series GPUs.☆99Updated this week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆263Updated 6 months ago
- ☆78Updated 5 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆174Updated 11 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆35Updated last year
- Core, Junction, and VRAM temperature reader for Linux + GDDR6/GDDR6X GPUs☆39Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆358Updated last week
- 8-bit CUDA functions for PyTorch☆48Updated 2 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆197Updated 9 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆65Updated last month
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆82Updated last month
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆116Updated 2 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆111Updated this week