simveit/effective_transpose

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/simveit/effective_transpose)

simveit / effective_transpose

Effective transpose on Hopper GPU

☆29

Alternatives and similar repositories for effective_transpose

Users that are interested in effective_transpose are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zhuzilin / vllm-group
View on GitHub
☆12Nov 5, 2024Updated last year
Faraz9877 / H100_GEMM
View on GitHub
High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Cluste…
☆11Dec 4, 2024Updated last year
gau-nernst / gpu-mode-kernels
View on GitHub
https://github.com/gpu-mode/reference-kernels
☆26Jul 4, 2026Updated 2 weeks ago
HanGuo97 / hilt
View on GitHub
☆40Dec 14, 2025Updated 7 months ago
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
yifuwang / symm-mem-recipes
View on GitHub
☆170Dec 27, 2024Updated last year
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆136Nov 30, 2025Updated 7 months ago
HuyNguyen-hust / hopper-gemm-101
View on GitHub
☆13Dec 22, 2024Updated last year
FlorianRhiem / VFRendering
View on GitHub
A vector field rendering library
☆17Jul 31, 2019Updated 6 years ago
chips-compilers-mlsys-21 / chips-compilers-mlsys-21.github.io
View on GitHub
☆11Apr 5, 2021Updated 5 years ago
mauricioschneider / CS169.1x
View on GitHub
CS169.1x Software as a Service course offered by UC Berkeley at edx.org
☆14Oct 28, 2014Updated 11 years ago
babelouest / orcania
View on GitHub
Potluck with different functions for different purposes that can be shared among C programs
☆14May 9, 2026Updated 2 months ago
babelouest / angharad
View on GitHub
Personal house automation system with a REST/Json interface
☆18Feb 20, 2024Updated 2 years ago
ademeure / cuda-side-boost
View on GitHub
☆60Feb 24, 2026Updated 4 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
babelouest / yder
View on GitHub
Logging library for C applications
☆23Apr 26, 2026Updated 2 months ago
stockeh / mlx-grokking
View on GitHub
Grokking on modular arithmetic in less than 150 epochs in MLX
☆15Oct 24, 2024Updated last year
little-squirrel-cute / most
View on GitHub
Codes for MO's Trading
☆16Mar 20, 2022Updated 4 years ago
ShaYeBuHui01 / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆15Aug 31, 2023Updated 2 years ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆86May 5, 2025Updated last year
meta-pytorch / MSLK
View on GitHub
MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…
☆121Updated this week
enp1s0 / cuMpSGEMM
View on GitHub
Fast SGEMM emulation on Tensor Cores
☆17Feb 16, 2025Updated last year
toyaix / triton-ocl
View on GitHub
Triton for OpenCL backend, and use mlir-translate to get source OpenCL code
☆27Aug 27, 2025Updated 10 months ago
leloykun / flash-attention-minimal
View on GitHub
Flash Attention in 300-500 lines of CUDA/C++
☆39Aug 22, 2025Updated 10 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
deciding / cutez
View on GitHub
CuTeDSL tutorials, tools, autotuner, profiler, etc.
☆40Jun 27, 2026Updated 3 weeks ago
erfanzar / jax-flash-attn2
View on GitHub
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/…
☆34Mar 4, 2025Updated last year
lucidrains / SRT-H
View on GitHub
Implementation of the model architecture for SRT-H
☆30Jun 20, 2026Updated last month
rutura / ThreadingIPCCode
View on GitHub
Source code for LearnQtGuide's Threading and IPC with Qt C++ Course
☆17Nov 11, 2019Updated 6 years ago
kyegomez / SelfExtend
View on GitHub
Implementation of SelfExtend from the paper "LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning" from Pytorch and Zeta
☆13Nov 11, 2024Updated last year
fzyzcjy / torch_memory_saver
View on GitHub
Allow torch tensor memory to be released and resumed later
☆259Updated this week
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆583Sep 18, 2025Updated 10 months ago
3rdparty / stout-borrowed-ptr
View on GitHub
C++ "borrowing" smart pointer.
☆10May 13, 2022Updated 4 years ago
NVIDIA / HMM_sample_code
View on GitHub
CUDA 12.2 HMM demos
☆21Jul 26, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
RGivisiez / Heisenberg-SSE
View on GitHub
Stochastic Series Expansion (SSE) for a isotropic S=1/2 antiferromagnetic quantum Heisenberg model in 1D, 2D or 3D lattice . Every lattic…
☆15Jan 23, 2021Updated 5 years ago
eisneim / photong
View on GitHub
self hosted responsive photo/album manager & server writen in nodejs, koa2, react, redux
☆11May 25, 2017Updated 9 years ago
m13v / Your_Devin
View on GitHub
Fine-tune copilot based on your codebase
☆12Mar 26, 2024Updated 2 years ago
goombalab / Gather-and-Aggregate
View on GitHub
Experiments Notebook of "Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism"
☆16Apr 30, 2025Updated last year
sunnnybala / gpt-oss
View on GitHub
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
☆17Jan 12, 2026Updated 6 months ago
kyegomez / MobileVLM
View on GitHub
Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …
☆15Mar 11, 2024Updated 2 years ago
rutura / Qt6QMLBeginnersCode
View on GitHub
Source code for the Qt6 QML For Beginners book
☆21Sep 8, 2025Updated 10 months ago