simveit / effective_transposeLinks
Effective transpose on Hopper GPU
☆27Updated 5 months ago
Alternatives and similar repositories for effective_transpose
Users that are interested in effective_transpose are comparing it to the libraries listed below
Sorting:
- extensible collectives library in triton☆95Updated 10 months ago
- ☆104Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Updated 7 months ago
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆64Updated 2 weeks ago
- A bunch of kernels that might make stuff slower 😉☆75Updated last week
- Triton-based Symmetric Memory operators and examples☆81Updated 3 weeks ago
- Ship correct and fast LLM kernels to PyTorch☆140Updated 3 weeks ago
- Framework to reduce autotune overhead to zero for well known deployments.