☆27Jan 8, 2024Updated 2 years ago
Alternatives and similar repositories for gpu_kernels
Users that are interested in gpu_kernels are comparing it to the libraries listed below
Sorting:
- GPTQ inference Triton kernel☆321May 18, 2023Updated 2 years ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆11Dec 13, 2023Updated 2 years ago
- Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its…☆21Sep 10, 2024Updated last year
- Codes & examples for "CUDA - From Correctness to Performance"☆123Oct 24, 2024Updated last year
- ☆26Feb 17, 2025Updated last year
- flash attention tutorial written in python, triton, cuda, cutlass☆491Jan 20, 2026Updated 2 months ago
- Möbius Transformation for Fast Inner Product Search on Graph☆22Jun 3, 2021Updated 4 years ago
- Inference Llama 2 in one file of pure Cuda☆16Aug 20, 2023Updated 2 years ago
- llama INT4 cuda inference with AWQ☆53Jan 20, 2025Updated last year
- Follow nginx log, and find out bad guys!☆23Mar 7, 2026Updated last week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆323Mar 4, 2025Updated last year
- ☆150Jan 9, 2025Updated last year
- LLVM-Canon aims to transform LLVM modules into a canonical form by reordering and renaming instructions while preserving the same semanti…☆32Apr 30, 2024Updated last year
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation☆33Feb 26, 2026Updated 3 weeks ago
- Deep learning framework using C++17 in a single header file☆31Sep 13, 2020Updated 5 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆409Mar 5, 2026Updated 2 weeks ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,041Sep 4, 2024Updated last year
- vLLM plugin for RBLN NPU☆43Mar 13, 2026Updated last week
- Benchmark suite for LLMs from Fireworks.ai☆95Mar 11, 2026Updated last week
- ☆14Nov 28, 2023Updated 2 years ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆480Apr 19, 2025Updated 11 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆270Oct 3, 2025Updated 5 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,621Jul 12, 2024Updated last year
- ☆56Nov 14, 2024Updated last year
- ☆11Sep 4, 2022Updated 3 years ago
- a minimal cache manager for PagedAttention, on top of llama3.☆138Aug 26, 2024Updated last year
- Parallel Self-Adjusting Computation☆16Jul 5, 2021Updated 4 years ago
- Demo of fine-tuning QA models for answering FAQ of cloud providers documentation☆11Mar 7, 2023Updated 3 years ago
- ☆12Mar 13, 2023Updated 3 years ago
- Transformers components but in Triton☆34May 9, 2025Updated 10 months ago
- LLM Inference with Microscaling Format☆34Nov 12, 2024Updated last year
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated last year
- Applied AI experiments and examples for PyTorch☆319Aug 22, 2025Updated 6 months ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆35Aug 28, 2023Updated 2 years ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆144Mar 31, 2023Updated 2 years ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆34Sep 15, 2023Updated 2 years ago
- Source code for "BenchPress: A Deep Active Benchmark Generator", PACT 2022☆21Mar 15, 2023Updated 3 years ago
- a fast and customizable CUDA int4 tensor core gemm☆15Aug 2, 2024Updated last year
- ☆13Oct 5, 2020Updated 5 years ago