ReinForce-II / mmapeakLinks

☆45

Alternatives and similar repositories for mmapeak

Users that are interested in mmapeak are comparing it to the libraries listed below

Sorting:

aikitoria / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆78Updated 2 weeks ago
Cornell-RelaxML / qtip
☆154Updated 4 months ago
mag- / gpu_benchmark
Gpu benchmark
☆72Updated 9 months ago
BlinkDL / fast.c
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
☆73Updated 9 months ago
sekstini / gpupoor
☆18Updated 11 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆204Updated last week
ROCm / flash-attention
Fast and memory-efficient exact attention
☆200Updated last month
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆110Updated 9 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆395Updated 3 weeks ago
KONAKONA666 / q8_kernels
☆77Updated 10 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆91Updated 5 months ago
Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆48Updated last year
IST-DASLab / Quartet
☆106Updated 2 weeks ago
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated 2 years ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
ROCm / aiter
AI Tensor Engine for ROCm
☆301Updated this week
huggingface / optimum-amd
AMD related optimizations for transformer models
☆95Updated last month
LeanModels / DFloat11
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆560Updated 2 months ago
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 7 months ago
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆171Updated this week
CerebrasResearch / reap
REAP: Router-weighted Expert Activation Pruning for SMoE compression
☆112Updated last week
ROCm / iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆111Updated this week
kroggen / mamba.c
Inference of Mamba models in pure C
☆192Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆83Updated last week
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆310Updated 5 months ago
ROCm / triton
Development repository for the Triton language and compiler
☆137Updated last week
neuralmagic / AutoFP8
☆205Updated 6 months ago
KevlarKanou / rwkv7.c
Inference RWKV v7 in pure C.
☆41Updated last month