MagellaX / StreamAttnLinks
A high-performance attention mechanism that computes softmax normalization in a single streaming pass using running accumulators (online softmax).
☆28Updated 4 months ago
Alternatives and similar repositories for StreamAttn
Users that are interested in StreamAttn are comparing it to the libraries listed below
Sorting:
- Learning about CUDA by writing PTX code.☆152Updated last year
- Quantized LLM training in pure CUDA/C++.☆238Updated 3 weeks ago
- SIMD quantization kernels☆94Updated 5 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 10 months ago
- pytorch from scratch in pure C/CUDA and python☆41Updated last year
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆141Updated 5 months ago
- ☆90Updated last month
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆417Updated last month
- Tensor library with autograd using only Rust's standard library☆71Updated last year
- ☆89Updated 3 months ago
- Experimental GPU language with meta-programming☆25Updated last year
- Rust Implementation of micrograd☆53Updated last year
- ☆73Updated last week
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆81Updated 8 months ago
- NSA Triton Kernels written with GPT5 and Opus 4.1☆70Updated 6 months ago
- moondream in zig.☆75Updated 8 months ago
- an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)☆110Updated 11 months ago
- Lego for GRPO☆30Updated 8 months ago
- NanoGPT-speedrunning for the poor T4 enjoyers☆73Updated 9 months ago
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- look how they massacred my boy☆63Updated last year
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 8 months ago
- Samples of good AI generated CUDA kernels☆99Updated 8 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆155Updated 2 years ago
- A really tiny autograd engine☆99Updated 8 months ago
- tiny code to access tenstorrent blackhole☆61Updated 8 months ago
- PyTorch memory allocation visualizer☆67Updated 6 months ago
- Simple high-throughput inference library☆155Updated 8 months ago
- 👷 Build compute kernels☆215Updated 2 weeks ago
- ☆466Updated 2 months ago