Flash Attention from Scratch on CUDA Ampere
☆146Sep 1, 2025Updated 6 months ago
Alternatives and similar repositories for flash_attention_from_scratch
Users that are interested in flash_attention_from_scratch are comparing it to the libraries listed below
Sorting:
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- This is a project created and completed by team BOOM(Beihang OO masters).This is a superscalar processor with a 13-stage out-of-order dua…☆17Sep 29, 2024Updated last year
- Low overhead tracing library and trace visualizer for pipelined CUDA kernels☆131Nov 26, 2025Updated 3 months ago
- ☆32Jul 28, 2025Updated 7 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆60Mar 25, 2025Updated 11 months ago
- A Heterogeneous GPU Platform for Chipyard SoC☆44Updated this week
- Open ABI and FFI for Machine Learning Systems☆355Updated this week
- ☆65Apr 26, 2025Updated 10 months ago
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 4 months ago
- Multi-GPU dynamic scheduler using PGAS style cross-GPU communication☆29Jul 23, 2023Updated 2 years ago
- NVIDIA cuTile learn☆165Dec 9, 2025Updated 2 months ago
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆9,815Feb 25, 2026Updated last week
- A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-perfo…☆88Feb 2, 2026Updated last month
- 该文档是个人阅读学习蜂鸟E203源码的笔记☆13Aug 1, 2023Updated 2 years ago
- Luthier, a GPU binary instrumentation tool for AMD GPUs☆27Updated this week
- Linux-capable in-order superscaler LoongArch32r processor. Silicon-proven.☆45Jul 25, 2024Updated last year
- This repository contains a SystemVerilog implementation of a parametrized Round Robin arbiter with three instantiation options☆13Jan 28, 2024Updated 2 years ago
- CS6868: Concurrent Programming☆32Feb 26, 2026Updated last week
- Optimize with SigOpt with this standalone SigOpt client driver.☆12Updated this week
- A parser for PTX 6.5☆13Jun 19, 2023Updated 2 years ago
- Material for gpu-mode lectures☆5,800Feb 1, 2026Updated last month
- Puzzles for learning Triton, play it with minimal environment configuration!☆634Dec 28, 2025Updated 2 months ago
- how to optimize some algorithm in cuda.☆2,841Feb 28, 2026Updated last week
- Benchmark code for the "Online normalizer calculation for softmax" paper☆108Jul 27, 2018Updated 7 years ago
- ☆123Updated this week
- Mars with BUAA CO extension by Toby Shi☆40Nov 20, 2024Updated last year
- A lightweight design for computation-communication overlap.☆223Jan 20, 2026Updated last month
- Accepted to MLSys 2026☆70Updated this week
- ☆11Dec 23, 2025Updated 2 months ago
- Notes for the book Fluent Python, 1st Edition (O'Reilly, 2015)☆11Jun 30, 2022Updated 3 years ago
- RISC-V vector and tensor compute extensions for Vortex GPGPU acceleration for ML workloads. Optimized for transformer models, CNNs, and g…☆21Apr 25, 2025Updated 10 months ago
- A framework to make C memory safe☆13Sep 20, 2022Updated 3 years ago
- A cross-modal vector index with fast construction on heterogeneous CPU-GPU environment. Published on DaMoN@SIGMOD 2025.☆16Jul 16, 2025Updated 7 months ago
- Automated bottleneck detection and solution orchestration☆19Feb 24, 2026Updated last week
- ☆11Jun 9, 2023Updated 2 years ago
- Generate Linux Perf event tables for Apple Silicon☆17Dec 16, 2025Updated 2 months ago
- A docker image for One Student One Chip's debug exam☆10Sep 22, 2023Updated 2 years ago
- 一个开源数学大模型项目,旨在探索大模型是否具有数学创造能力,以及大模型在前沿数学研究中的潜在能力。☆17May 16, 2025Updated 9 months ago
- ☆14Oct 30, 2024Updated last year