KONAKONA666 / q8_kernels
☆59Updated last month
Alternatives and similar repositories for q8_kernels:
Users that are interested in q8_kernels are comparing it to the libraries listed below
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆17Updated 3 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆53Updated 2 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆53Updated 3 weeks ago
- Context parallel attention that accelerates DiT model inference with dynamic caching☆189Updated this week
- ☆107Updated last month
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆39Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆89Updated this week
- Patch convolution to avoid large GPU memory usage of Conv2D☆85Updated 3 weeks ago
- Writing FLUX in Triton☆32Updated 5 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- ☆61Updated 3 weeks ago
- extensible collectives library in triton☆83Updated 4 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆234Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated 2 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆97Updated 7 months ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆25Updated 2 months ago
- ring-attention experiments☆123Updated 4 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆330Updated this week
- ☆67Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- Fast low-bit matmul kernels in Triton☆238Updated this week
- ☆46Updated last year
- QuIP quantization☆50Updated 11 months ago
- 📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉☆189Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- ☆157Updated last year
- ☆180Updated this week
- Focused on fast experimentation and simplicity☆66Updated last month