HuyNguyen-hust / hopper-gemm-101Links
☆11Updated 11 months ago
Alternatives and similar repositories for hopper-gemm-101
Users that are interested in hopper-gemm-101 are comparing it to the libraries listed below
Sorting:
- KV cache compression via sparse coding☆16Updated last month
- The evaluation framework for training-free sparse attention in LLMs☆106Updated 2 months ago
- ☆132Updated 6 months ago
- Transformers components but in Triton☆34Updated 7 months ago
- ☆18Updated last year
- Official implementation for "Pruning Large Language Models with Semi-Structural Adaptive Sparse Training" (AAAI 2025)☆16Updated 5 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆126Updated 5 months ago
- ☆15Updated last year
- A bunch of kernels that might make stuff slower 😉☆65Updated 2 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆86Updated last year
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Updated 2 months ago
- ☆52Updated 7 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆222Updated 6 months ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆96Updated 3 months ago
- ☆259Updated 6 months ago
- The official repository of Quamba1 [ICLR 2025] & Quamba2 [ICML 2025]☆63Updated 6 months ago
- HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…☆29Updated 10 months ago
- Fast and memory-efficient exact attention☆74Updated 9 months ago
- AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)☆30Updated 2 months ago
- ☆60Updated last year
- ☆113Updated last month
- Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”☆78Updated this week
- Vortex: A Flexible and Efficient Sparse Attention Framework☆43Updated 2 weeks ago
- Mixed precision training from scratch with Tensors and CUDA☆28Updated last year
- extensible collectives library in triton☆91Updated 8 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆147Updated last month
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆107Updated last month
- ☆39Updated this week
- ☆125Updated 4 months ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆155Updated this week