gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆44Updated this week
Alternatives and similar repositories for discord-cluster-manager:
Users that are interested in discord-cluster-manager are comparing it to the libraries listed below
- Experiment of using Tangent to autodiff triton☆78Updated last year
- extensible collectives library in triton☆85Updated last month
- ring-attention experiments☆132Updated 6 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated last month
- Collection of kernels written in Triton language☆121Updated last month
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- Load compute kernels from the Hub☆115Updated 2 weeks ago
- ☆88Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆130Updated last year
- ☆202Updated last week
- Cataloging released Triton kernels.☆220Updated 3 months ago
- Make triton easier☆47Updated 10 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆122Updated this week
- ☆13Updated last month
- Applied AI experiments and examples for PyTorch☆264Updated last week
- prime-rl is a codebase for decentralized RL training at scale☆85Updated this week
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆60Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 5 months ago
- PyTorch centric eager mode debugger☆47Updated 4 months ago
- ☆21Updated 2 months ago
- Learn CUDA with PyTorch☆20Updated 3 months ago
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- ☆78Updated 6 months ago
- Fast low-bit matmul kernels in Triton☆295Updated this week
- making the official triton tutorials actually comprehensible☆27Updated last month
- ☆43Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 9 months ago
- Custom kernels in Triton language for accelerating LLMs☆18Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 9 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago