aikitoria / open-gpu-kernel-modulesLinks

NVIDIA Linux open GPU with P2P support

☆83

Alternatives and similar repositories for open-gpu-kernel-modules

Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below

Sorting:

LeanModels / DFloat11
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆562Updated this week
CerebrasResearch / reap
REAP: Router-weighted Expert Activation Pruning for SMoE compression
☆119Updated 3 weeks ago
Cornell-RelaxML / qtip
☆156Updated 5 months ago
ReinForce-II / mmapeak
☆46Updated last month
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆92Updated 6 months ago
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆125Updated 8 months ago
NimbleEdge / sparse_transformers
Sparse Inferencing for transformer based LLMs
☆213Updated 3 months ago
SJTU-IPADS / Bamboo
Bamboo-7B Large Language Model
☆92Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆61Updated last year
ROCm / flash-attention
Fast and memory-efficient exact attention
☆201Updated last month
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆210Updated last week
inferx-net / inferx
InferX: Inference as a Service Platform
☆139Updated this week
Cornell-RelaxML / yaqa-quantization
☆63Updated 5 months ago
jukofyork / transplant-vocab
Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.
☆46Updated last month
tdrussell / qlora-pipe
A pipeline parallel training script for LLMs.
☆163Updated 7 months ago
leafspark / AutoGGUF
automatically quant GGUF models
☆217Updated last month
Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆48Updated last year
BlinkDL / fast.c
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆73Updated 9 months ago
turboderp-org / exllamav3
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆586Updated this week
mag- / gpu_benchmark
Gpu benchmark
☆73Updated 10 months ago
matt-c1 / llama-3-quant-comparison
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
☆165Updated last year
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆32Updated 9 months ago
wdlctc / headinfer
☆60Updated 6 months ago
powerserve-project / PowerServe
High-speed and easy-use LLM serving framework for local deployment
☆137Updated 3 months ago
chu-tianxiang / llama-cpp-torch
llama.cpp to PyTorch Converter
☆34Updated last year
bjj / exllamav2-openai-server
An OpenAI API compatible LLM inference server based on ExLlamaV2.
☆25Updated last year
KONAKONA666 / q8_kernels
☆77Updated 11 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆143Updated 9 months ago
Thireus / GGUF-Tool-Suite
Input your VRAM and RAM and the toolchain will produce a GGUF model tuned to your system within seconds — flexible model sizing and lowes…
☆65Updated last week
chigkim / Ollama-MMLU-Pro
☆107Updated 3 months ago