ReinForce-II / mmapeakLinks
☆31Updated 3 months ago
Alternatives and similar repositories for mmapeak
Users that are interested in mmapeak are comparing it to the libraries listed below
Sorting:
- ☆139Updated 3 weeks ago
- NVIDIA Linux open GPU with P2P support☆25Updated last month
- Gpu benchmark☆63Updated 5 months ago
- Fast and memory-efficient exact attention☆177Updated this week
- ☆71Updated 6 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆446Updated last month
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 5 months ago
- llama.cpp to PyTorch Converter☆33Updated last year
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆43Updated 10 months ago
- automatically quant GGUF models☆187Updated this week
- ☆71Updated 2 weeks ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆277Updated last year
- Samples of good AI generated CUDA kernels☆84Updated last month
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- AMD related optimizations for transformer models☆80Updated 3 weeks ago
- LLM Inference on consumer devices☆121Updated 4 months ago
- ☆17Updated 7 months ago
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆46Updated 4 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆153Updated 9 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆160Updated this week
- LLM training in simple, raw C/CUDA☆99Updated last year
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated 10 months ago
- AI Tensor Engine for ROCm☆232Updated this week
- Python bindings for ggml☆142Updated 10 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆277Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆135Updated this week
- ring-attention experiments☆143Updated 9 months ago
- Development repository for the Triton language and compiler☆125Updated this week
- A collection of tricks and tools to speed up transformer models☆170Updated last month
- Fast low-bit matmul kernels in Triton☆330Updated last week