ReinForce-II / mmapeakLinks
☆27Updated 2 months ago
Alternatives and similar repositories for mmapeak
Users that are interested in mmapeak are comparing it to the libraries listed below
Sorting:
- ☆130Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆94Updated 4 months ago
- Gpu benchmark☆63Updated 4 months ago
- ☆70Updated 5 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆275Updated last year
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 4 months ago
- Fast and memory-efficient exact attention☆173Updated this week
- Python bindings for ggml☆141Updated 9 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆272Updated 2 weeks ago
- ☆17Updated 6 months ago
- ☆140Updated 6 months ago
- QuIP quantization☆52Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆251Updated 7 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 7 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆67Updated 6 months ago
- Fast low-bit matmul kernels in Triton☆311Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆43Updated 9 months ago
- Learning about CUDA by writing PTX code.☆131Updated last year
- Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning☆42Updated 3 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆117Updated this week
- RWKV-7: Surpassing GPT☆88Updated 6 months ago
- A collection of tricks and tools to speed up transformer models☆167Updated this week
- Inference of Mamba models in pure C☆187Updated last year
- RWKV in nanoGPT style☆191Updated 11 months ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Code for data-aware compression of DeepSeek models☆31Updated last month
- Model Compression Toolbox for Large Language Models and Diffusion Models☆489Updated 2 months ago
- Samples of good AI generated CUDA kernels☆65Updated last week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- 1.58-bit LLaMa model☆81Updated last year