feifeibear / ChituAttention
Quantized Attention on GPU
☆29Updated last week
Related projects ⓘ
Alternatives and complementary repositories for ChituAttention
- GPTQ inference TVM kernel☆35Updated 6 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆53Updated last week
- ☆18Updated last month
- A sparse attention kernel supporting mix sparse patterns☆53Updated 3 weeks ago
- ☆79Updated 2 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆23Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆55Updated 4 months ago
- [WIP] Context parallel attention that works with torch.compile☆20Updated this week
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆52Updated 3 months ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆59Updated this week
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆12Updated last month
- Puzzles for learning Triton, play it with minimal environment configuration!☆61Updated this week
- ☆42Updated 7 months ago
- Debug print operator for cudagraph debugging☆10Updated 3 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- 16-fold memory access reduction with nearly no loss☆57Updated this week
- TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆21Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- ☆63Updated 3 months ago
- TensorRT LLM Benchmark Configuration☆11Updated 3 months ago
- ☆42Updated 6 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated 10 months ago
- ☆46Updated last month
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆76Updated last month
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆101Updated 2 months ago
- ☆42Updated 4 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆40Updated last month
- Transformers components but in Triton☆25Updated this week