axonn-ai / axonnLinks
A parallel framework for training deep neural networks
☆63Updated 4 months ago
Alternatives and similar repositories for axonn
Users that are interested in axonn are comparing it to the libraries listed below
Sorting:
- extensible collectives library in triton☆88Updated 4 months ago
- A bunch of kernels that might make stuff slower 😉☆56Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- ring-attention experiments☆146Updated 9 months ago
- ☆107Updated 11 months ago
- ☆85Updated 9 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated this week
- Collection of kernels written in Triton language☆142Updated 4 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆207Updated this week
- Applied AI experiments and examples for PyTorch☆289Updated 2 months ago
- ☆227Updated this week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆152Updated last month
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆212Updated this week
- Cataloging released Triton kernels.☆247Updated 6 months ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 7 months ago
- ☆74Updated 4 months ago
- Experiment of using Tangent to autodiff triton☆80Updated last year
- Effective transpose on Hopper GPU☆23Updated 3 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆41Updated last year
- ☆28Updated 6 months ago
- ☆14Updated 3 weeks ago
- Learn CUDA with PyTorch☆33Updated 3 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆107Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆199Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 11 months ago
- Triton-based Symmetric Memory operators and examples☆23Updated last week
- Framework to reduce autotune overhead to zero for well known deployments.☆79Updated 2 weeks ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆118Updated 8 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆230Updated 8 months ago
- Fast low-bit matmul kernels in Triton☆339Updated this week