MagellaX / StreamAttnLinks

A high-performance attention mechanism that computes softmax normalization in a single streaming pass using running accumulators (online softmax).

☆27

Alternatives and similar repositories for StreamAttn

Users that are interested in StreamAttn are comparing it to the libraries listed below

Sorting:

unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆147Updated last year
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆218Updated this week
PrimeIntellect-ai / pccl
PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP
☆138Updated 2 months ago
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 8 months ago
PrimeIntellect-ai / pi-quant
SIMD quantization kernels
☆92Updated 2 months ago
nreHieW / r-nn
Tensor library with autograd using only Rust's standard library
☆70Updated last year
apoorvnandan / lilgrad
pytorch from scratch in pure C/CUDA and python
☆41Updated last year
naklecha / llm-inference-optimizations-explained
in this repository, i'm going to implement increasingly complex llm inference optimizations
☆70Updated 6 months ago
Laz4rz / ziglings
Extensive introductory writeup on Zig language functionalities
☆10Updated last year
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆151Updated 2 years ago
xjdr-alt / simple_transformer
Simple Transformer in Jax
☆139Updated last year
charlesfrye / cuda-substrings
Because it's there.
☆16Updated last year
JoeLi12345 / nGPT
an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)
☆108Updated 8 months ago
Snektron / gpumode-amd-fp8-mm
My submission for the GPUMODE/AMD fp8 mm challenge
☆29Updated 5 months ago
PrimeIntellect-ai / protocol
peer-to-peer compute and intelligence network that enables decentralized AI development at scale
☆132Updated 2 weeks ago
spikedoanz / from-bits-to-intelligence
could we make an ml stack in 100,000 lines of code?
☆46Updated last year
linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆195Updated 2 years ago
VatsaDev / NanoPoor
NanoGPT-speedrunning for the poor T4 enjoyers
☆73Updated 7 months ago
SzymonOzog / GPU_Programming
☆85Updated 2 weeks ago
kuterd / opal_ptx
Experimental GPU language with meta-programming
☆24Updated last year
huggingface / kernel-builder
👷 Build compute kernels
☆186Updated this week
joey00072 / Tinytorch
A really tiny autograd engine
☆96Updated 6 months ago
SzymonOzog / Penny
Hand-Rolled GPU communications library
☆70Updated last week
leloykun / modded-nanogpt
NanoGPT (124M) quality in 2.67B tokens
☆28Updated 2 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆92Updated 6 months ago
kanpuriyanawab / picograd
Rust Implementation of micrograd
☆53Updated last year
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆112Updated 10 months ago
xjdr-alt / llmri
look how they massacred my boy
☆63Updated last year
kubernetes-bad / reward-composer
Lego for GRPO
☆30Updated 6 months ago
LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆134Updated 9 months ago