ScalingIntelligence / good-kernelsLinks
Samples of good AI generated CUDA kernels
☆86Updated 2 months ago
Alternatives and similar repositories for good-kernels
Users that are interested in good-kernels are comparing it to the libraries listed below
Sorting:
- RWKV-7: Surpassing GPT☆94Updated 8 months ago
- ☆145Updated last month
- ☆75Updated last month
- Official implementation for Training LLMs with MXFP4☆55Updated 3 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆142Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆127Updated 8 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆225Updated 8 months ago
- Simple high-throughput inference library☆125Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels☆138Updated this week
- ☆215Updated 6 months ago
- LLM Inference on consumer devices☆123Updated 4 months ago
- ☆44Updated last month
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆110Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆244Updated 6 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 4 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆149Updated last month
- Work in progress.☆70Updated last month
- PB-LLM: Partially Binarized Large Language Models☆153Updated last year
- QuIP quantization☆55Updated last year
- ☆51Updated 9 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆142Updated this week
- 👷 Build compute kernels☆87Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 11 months ago
- KV cache compression for high-throughput LLM inference☆134Updated 6 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆93Updated last month
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆72Updated 6 months ago
- Token Omission Via Attention☆128Updated 9 months ago
- extensible collectives library in triton☆88Updated 4 months ago