NVIDIA / jaxppLinks

JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training

☆52

Alternatives and similar repositories for jaxpp

Users that are interested in jaxpp are comparing it to the libraries listed below

Sorting:

pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆212Updated this week
cchan / tccl
extensible collectives library in triton
☆88Updated 4 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆199Updated this week
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated this week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
☆139Updated this week
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆388Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆142Updated 4 months ago
triton-lang / kernels
☆85Updated 9 months ago
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆339Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆207Updated this week
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated last year
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated last week
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆107Updated 2 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
openxla / shardy
MLIR-based partitioning system
☆115Updated this week
pranjalssh / fast.cu
Fastest kernels written from scratch
☆310Updated 4 months ago
facebookresearch / HolisticTraceAnalysis
A library to analyze PyTorch traces.
☆402Updated this week
stanford-futuredata / stk
☆107Updated 11 months ago
albanD / subclass_zoo
☆171Updated last year
ColfaxResearch / cutlass-kernels
☆228Updated last year
google / aqt
☆323Updated last week
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆75Updated this week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆71Updated 3 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated 2 weeks ago
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
bertmaher / simplegemm
☆110Updated 4 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year