cassiewilliam/cuda_op_benchmark

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cassiewilliam/cuda_op_benchmark)

cassiewilliam / cuda_op_benchmark

方便扩展的Cuda算子理解和优化框架，仅用在学习使用

☆18

Alternatives and similar repositories for cuda_op_benchmark

Users that are interested in cuda_op_benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
Hyaloid / AccSpMM
View on GitHub
Official implementation of Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores.
☆15Nov 13, 2025Updated 7 months ago
HazyResearch / embroid
View on GitHub
Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification
☆11Aug 12, 2023Updated 2 years ago
GindaChen / nsys-ai
View on GitHub
Terminal UI for NVIDIA Nsight Systems profiles — timeline viewer, kernel navigator, NVTX hierarchy
☆58Updated this week
Noahs-ARK / PaLM
View on GitHub
PyTorch implementation for PaLM: A Hybrid Parser and Language Model.
☆10Jan 7, 2020Updated 6 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 10 months ago
eunomia-bpf / bpf-benchmark
View on GitHub
Userspace eBPF Runtime Benchmarking Test Suite and Results
☆17Updated this week
wzsh / wmma_tensorcore_sample
View on GitHub
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆147Aug 18, 2020Updated 5 years ago
YdrMaster / cuda-driver
View on GitHub
基于 CUDA Driver API 的 cuda 运行时环境
☆16Jul 30, 2025Updated 10 months ago
jiegec / chisel-memory-lower
View on GitHub
Lower chisel memories to SRAM macros
☆13Mar 25, 2024Updated 2 years ago
alexshuang / fleet-compiler
View on GitHub
An MLIR-based AI compiler designed for Python frontend to RISC-V DSA
☆15Oct 10, 2024Updated last year
stemnic / rustyvisor
View on GitHub
Hypervisor written in Rust for the RISC-V 1.0 hypervisor extension
☆16Oct 21, 2024Updated last year
IDSIA / rtrl-elstm
View on GitHub
Official repository for the paper "Exploring the Promise and Limits of Real-Time Recurrent Learning" (ICLR 2024)
☆13Jun 11, 2025Updated last year
faasm / faasmjs
View on GitHub
Serverless browser offloading with Faasm and WebAssembly
☆16Feb 14, 2022Updated 4 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
srush / tangent
View on GitHub
Source-to-Source Debuggable Derivatives in Pure Python
☆15Jan 23, 2024Updated 2 years ago
Benjamin-Walker / selective-ssms-and-linear-cdes
View on GitHub
Code for "Theoretical Foundations of Deep Selective State-Space Models" (NeurIPS 2024)
☆16Jan 7, 2025Updated last year
rycolab / aflt-f2023
View on GitHub
Advanced Formal Language Theory (263-5352-00L; Frühjahr 2023)
☆10Feb 21, 2023Updated 3 years ago
siyuanseever / llama2Rnn.c
View on GitHub
☆13Apr 15, 2024Updated 2 years ago
yikangshen / megablocks
View on GitHub
☆20May 30, 2024Updated 2 years ago
Scientific-Computing-Lab / STREAMer
View on GitHub
STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth
☆18Aug 21, 2023Updated 2 years ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆41Feb 3, 2025Updated last year
proger / nanokitchen
View on GitHub
Parallel Associative Scan for Language Models
☆18Jan 8, 2024Updated 2 years ago
srush / mamba-scans
View on GitHub
Blog post
☆17Feb 16, 2024Updated 2 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
rcore-os / rCore-Tutorial-v3-arm64
View on GitHub
Let's write an OS which can run on ARM in Rust from scratch! (🚧WIP)
☆18Mar 13, 2022Updated 4 years ago
hazan-lab / flash-stu
View on GitHub
PyTorch implementation of the Flash Spectral Transform Unit.
☆22Sep 19, 2024Updated last year
Oyami-Srk / RISCV-GDB-Paging
View on GitHub
Paging Debug tool for GDB using python
☆13Jun 4, 2022Updated 4 years ago
radarFudan / Curse-of-memory
View on GitHub
Curse-of-memory phenomenon of RNNs in sequence modelling
☆19May 8, 2025Updated last year
acosharma / elita-transformer
View on GitHub
Official Repository for Efficient Linear-Time Attention Transformers.
☆18Jun 2, 2024Updated 2 years ago
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
maximzubkov / fft-scan
View on GitHub
Efficient PScan implementation in PyTorch
☆17Jan 2, 2024Updated 2 years ago
HPMLL / SpInfer_EuroSys25
View on GitHub
☆34Apr 2, 2025Updated last year
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆440Mar 5, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
subho406 / agalite
View on GitHub
AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning (Published in TMLR)
☆23Oct 15, 2024Updated last year
juvi21 / CoPE-cuda
View on GitHub
Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719
☆22Jun 5, 2024Updated 2 years ago
NVIDIA / HMM_sample_code
View on GitHub
CUDA 12.2 HMM demos
☆21Jul 26, 2024Updated last year
KurochkinAlexey / AntisymmetricRNN
View on GitHub
Python implementation of paper "AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks"
☆15Aug 2, 2019Updated 6 years ago
ParCIS / FlashSparse
View on GitHub
FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swa…
☆39Oct 5, 2025Updated 8 months ago
mit-han-lab / kernel-design-agents
View on GitHub
☆588Jun 2, 2026Updated 2 weeks ago
hbchen121 / SimpleCNN_Release
View on GitHub
pure c/cpp cnn implementation, with CUDA accelerated.
☆21Apr 30, 2021Updated 5 years ago