Quantized LLM training in pure CUDA/C++.
☆241Updated this week
Alternatives and similar repositories for llmq
Users that are interested in llmq are comparing it to the libraries listed below
Sorting:
- Write a fast kernel and run it on Discord. See how you compare against the best!☆74Feb 18, 2026Updated last week
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Jun 4, 2025Updated 8 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- Ship correct and fast LLM kernels to PyTorch☆142Jan 14, 2026Updated last month
- ☆19Nov 5, 2025Updated 3 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- ☆32Jul 2, 2025Updated 7 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆166Nov 11, 2025Updated 3 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆208Feb 18, 2026Updated last week
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- Framework to reduce autotune overhead to zero for well known deployments.☆96Sep 19, 2025Updated 5 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆195Updated this week
- Fast low-bit matmul kernels in Triton☆433Feb 1, 2026Updated 3 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆820Updated this week
- ☆104Sep 9, 2024Updated last year
- ☆118May 19, 2025Updated 9 months ago
- Các thí nghiệm liên quan tới LLMs cho tiếng Việt (insprised by Physics of LLMs Series)☆11Oct 21, 2024Updated last year
- ☆11Jan 25, 2022Updated 4 years ago
- Pytorch routines for (Ker)nel (Mac)hines☆10Oct 10, 2025Updated 4 months ago
- ☆23Jul 11, 2025Updated 7 months ago
- The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.☆13Mar 30, 2024Updated last year
- ☆11Dec 22, 2024Updated last year
- BFloat16 Fused Adam Operator for PyTorch☆16Nov 16, 2024Updated last year
- SIMD quantization kernels☆93Sep 7, 2025Updated 5 months ago
- Hand-Rolled GPU communications library☆85Nov 25, 2025Updated 3 months ago
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- ☆79Dec 27, 2024Updated last year
- ☆12Jun 11, 2022Updated 3 years ago
- Using fourier interpolation to merge large language models☆11Jan 6, 2026Updated last month
- ☆15Oct 30, 2025Updated 4 months ago
- Perplexity GPU Kernels☆564Nov 7, 2025Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆143May 29, 2025Updated 9 months ago
- ☆156Updated this week
- D. E. Shaw Research Technical Reports☆13Jul 4, 2022Updated 3 years ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆29Feb 17, 2025Updated last year
- This repository is for topologic and geometric data analysis.☆13Jul 2, 2022Updated 3 years ago
- Single-step image generation at 306 FPS. Drifting vs Diffusion head-to-head on CIFAR-10.☆36Feb 13, 2026Updated 2 weeks ago
- ☆14Nov 3, 2025Updated 3 months ago
- ☆130Aug 18, 2025Updated 6 months ago