ROCm / TransformerEngine
☆12Updated this week
Related projects ⓘ
Alternatives and complementary repositories for TransformerEngine
- ☆55Updated 5 months ago
- Fast and memory-efficient exact attention☆28Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆53Updated last week
- ☆46Updated last month
- ☆43Updated last week
- Quantized Attention on GPU☆29Updated last week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆76Updated last month
- ☆33Updated 2 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆196Updated 2 weeks ago
- GPTQ inference TVM kernel☆35Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆87Updated 4 months ago
- ☆79Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆35Updated 6 months ago
- ☆156Updated last year
- [WIP] Context parallel attention that works with torch.compile☆20Updated last week
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 5 months ago
- ☆22Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆16Updated this week
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆59Updated this week
- ☆47Updated 2 weeks ago
- Applied AI experiments and examples for PyTorch☆160Updated last week
- Simple and fast low-bit matmul kernels in CUDA / Triton☆140Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- extensible collectives library in triton☆65Updated last month
- A Python library transfers PyTorch tensors between CPU and NVMe☆96Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆19Updated 3 weeks ago
- vLLM performance dashboard☆18Updated 6 months ago