bertmaher / tf32_gemmLinks
Example of binding a TF32 CUTLASS GEMM kernel to PyTorch
☆12Updated last year
Alternatives and similar repositories for tf32_gemm
Users that are interested in tf32_gemm are comparing it to the libraries listed below
Sorting:
- ☆121Updated 9 months ago
- extensible collectives library in triton☆88Updated 6 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆328Updated this week
- Github mirror of trition-lang/triton repo.☆78Updated this week
- ☆177Updated last year
- ☆238Updated last year
- ☆83Updated 2 years ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆66Updated 6 months ago
- A resilient distributed training framework☆95Updated last year
- nnScaler: Compiling DNN models for Parallel Training☆118Updated last week
- ☆75Updated 4 years ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆49Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆418Updated this week
- Microsoft Collective Communication Library☆360Updated 2 years ago
- Applied AI experiments and examples for PyTorch☆296Updated last month
- Synthesizer for optimal collective communication algorithms☆117Updated last year
- Microsoft Collective Communication Library☆66Updated 10 months ago
- A lightweight design for computation-communication overlap.☆177Updated 2 weeks ago
- ☆81Updated 4 months ago
- ☆90Updated 10 months ago
- A library to analyze PyTorch traces.☆414Updated last week
- Fastest kernels written from scratch☆366Updated 2 weeks ago
- ☆242Updated this week
- A schedule language for large model training☆151Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆230Updated this week
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)☆87Updated 2 years ago
- A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems☆205Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆123Updated 4 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆146Updated 3 years ago
- ☆144Updated 4 months ago