Step by step implementation of a fast softmax kernel in CUDA
☆68Jan 6, 2025Updated last year
Alternatives and similar repositories for FastSoftmax
Users that are interested in FastSoftmax are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆96May 30, 2026Updated 2 weeks ago
- Residual Quantization Autoencoder, used for interpreting LLMs☆14Jan 1, 2025Updated last year
- learn TensorRT from scratch🥰☆18Sep 29, 2024Updated last year
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆19Feb 9, 2026Updated 4 months ago
- Standalone commandline CLI tool for compiling Triton kernels☆20Sep 13, 2024Updated last year
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆18Mar 12, 2025Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆47Jun 11, 2025Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆267Updated this week
- Super fast FP32 matrix multiplication on RDNA3