Samples of good AI generated CUDA kernels
☆102May 30, 2025Updated 9 months ago
Alternatives and similar repositories for good-kernels
Users that are interested in good-kernels are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- LLM as World Models using Bayesian inference☆16May 27, 2025Updated 9 months ago
- It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.☆78Updated this week
- ☆21May 13, 2022Updated 3 years ago
- TORCH_TRACE parser for PT2☆78Updated this week
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆122Jun 14, 2025Updated 9 months ago
- So, I trained a Llama a 130M architecture I coded from ground up to build a small instruct model from scratch. Trained on FineWeb dataset…☆17Mar 26, 2025Updated 11 months ago
- Optimizing diffusion for production-ready speeds☆39Jan 10, 2026Updated 2 months ago
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆15Jun 21, 2019Updated 6 years ago
- A collection of GPU experiments and benchmarks for my personal understanding and research.☆26Mar 15, 2026Updated last week
- Experiments Notebook of "Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism"☆15Apr 30, 2025Updated 10 months ago
- Inference deployment of the llama3☆10Apr 21, 2024Updated last year
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆23Feb 6, 2025Updated last year
- Development containers for triton and triton-cpu☆24Mar 9, 2026Updated 2 weeks ago
- ☆21Mar 12, 2026Updated last week
- An awesome list that curates the best Flet tools, tutorials, blogs and more.☆10Jan 8, 2023Updated 3 years ago
- EleutherAI ML Performance reading group repository (slides, meeting recordings, annotated papers)☆31Updated this week
- JAX Scalify: end-to-end scaled arithmetics☆18Oct 30, 2024Updated last year
- 模型加速/模型压缩(已完成所有Lab)☆11Dec 24, 2023Updated 2 years ago
- The official implementation of "EDA-DM: Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models"☆21Jul 8, 2025Updated 8 months ago
- ☆13Oct 17, 2024Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆196Updated this week
- SDXL GPU cluster scripts☆16Oct 28, 2023Updated 2 years ago
- ☆42Sep 8, 2023Updated 2 years ago
- smolbox of recipies☆29Apr 23, 2025Updated 11 months ago
- ☆50Jan 28, 2025Updated last year
- A torch compile backend for multi-targets☆47Updated this week
- Multichannel Looper/Feedback System for Riffusion☆14May 6, 2023Updated 2 years ago
- ZJU毛概资料汇总☆10Mar 16, 2024Updated 2 years ago
- This repo contains the benchmarks for Enzyme on GPU's☆11Feb 22, 2026Updated last month
- An insanely secure password manager.☆17Mar 10, 2026Updated last week
- Demo project for PistonDevelopers/skeletal_animation☆22Nov 14, 2023Updated 2 years ago
- Benchmarking LLMs on Typst☆19May 26, 2025Updated 9 months ago
- Training AI for Super Smash Bros. Melee☆32Mar 27, 2025Updated 11 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆149May 10, 2025Updated 10 months ago
- Tile primitives for speedy kernels☆3,244Updated this week
- 🤖 Complete reproduction of 'AlphaGo Moment for Model Architecture Discovery' using MLX-LM instead of GPT-4. Autonomous neural architectu…☆27Jul 27, 2025Updated 7 months ago
- HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration☆15Sep 14, 2020Updated 5 years ago
- High-Performance FP32 GEMM on CUDA devices☆118Jan 21, 2025Updated last year
- Allows two LLMs to communicate and run code in the terminal☆28Dec 8, 2024Updated last year