AndreSlavescu / mHC.cuLinks
mHC kernels implemented in CUDA
☆120Updated this week
Alternatives and similar repositories for mHC.cu
Users that are interested in mHC.cu are comparing it to the libraries listed below
Sorting:
- ☆263Updated 7 months ago
- A collection of tricks and tools to speed up transformer models☆194Updated 3 weeks ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆128Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆225Updated 6 months ago
- The evaluation framework for training-free sparse attention in LLMs☆108Updated 2 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆155Updated last month
- ☆133Updated 7 months ago
- Autonomous GPU Kernel Generation via Deep Agents☆202Updated this week
- Fast and memory-efficient exact attention☆75Updated 10 months ago
- ☆116Updated 7 months ago
- ring-attention experiments☆161Updated last year
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆73Updated last week
- Accelerating MoE with IO and Tile-aware Optimizations☆522Updated this week
- ☆52Updated 7 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 9 months ago
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆469Updated last month
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆415Updated 3 months ago
- ☆44Updated 9 months ago
- Efficient triton implementation of Native Sparse Attention.☆257Updated 7 months ago
- Ship correct and fast LLM kernels to PyTorch☆127Updated 3 weeks ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Updated 6 months ago
- DeeperGEMM: crazy optimized version☆74Updated 8 months ago
- Fast and memory-efficient exact kmeans☆131Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- ☆213Updated last month
- ☆96Updated 9 months ago
- coding CUDA everyday!☆72Updated last month
- Based on Nano-vLLM, a simple replication of vLLM with self-contained paged attention and flash attention implementation☆166Updated last week
- ☆125Updated 4 months ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆174Updated this week