IaroslavElistratov / triton-autodiffLinks
☆14Updated 3 weeks ago
Alternatives and similar repositories for triton-autodiff
Users that are interested in triton-autodiff are comparing it to the libraries listed below
Sorting:
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆133Updated last month
- How to ensure correctness and ship LLM generated kernels in PyTorch☆107Updated last week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆58Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 7 months ago
- ☆28Updated 9 months ago
- High-Performance SGEMM on CUDA devices☆107Updated 9 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆98Updated last week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆146Updated 2 years ago
- SIMD quantization kernels☆89Updated last month
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Custom kernels in Triton language for accelerating LLMs☆26Updated last year
- train with kittens!☆63Updated last year
- ☆89Updated last year
- ring-attention experiments☆155Updated last year
- A bunch of kernels that might make stuff slower 😉☆63Updated this week
- Automatic differentiation for Triton Kernels☆11Updated 2 months ago
- 👷 Build compute kernels☆163Updated this week
- Quantized LLM training in pure CUDA/C++.☆209Updated this week
- Make triton easier☆48Updated last year
- TORCH_LOGS parser for PT2☆62Updated last month
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆164Updated this week
- extensible collectives library in triton☆90Updated 6 months ago
- Learn CUDA with PyTorch☆95Updated last month
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆191Updated 2 years ago
- Collection of kernels written in Triton language☆159Updated 6 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆69Updated 3 weeks ago
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆55Updated 2 weeks ago
- Triton-based Symmetric Memory operators and examples☆48Updated last week
- LLM training in simple, raw C/CUDA☆107Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆305Updated this week