tspeterkim / mixed-precision-from-scratchLinks
Mixed precision training from scratch with Tensors and CUDA
☆23Updated last year
Alternatives and similar repositories for mixed-precision-from-scratch
Users that are interested in mixed-precision-from-scratch are comparing it to the libraries listed below
Sorting:
- Load compute kernels from the Hub☆144Updated this week
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆62Updated 4 months ago
- ring-attention experiments☆145Updated 7 months ago
- ☆157Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆70Updated last year
- Collection of kernels written in Triton language☆125Updated 2 months ago
- ☆108Updated last year
- Learn CUDA with PyTorch☆21Updated this week
- Experiment of using Tangent to autodiff triton☆79Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 7 months ago
- ☆93Updated last week
- ☆46Updated last week
- ☆215Updated this week
- ☆88Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆127Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆76Updated 9 months ago
- Cataloging released Triton kernels.☆229Updated 4 months ago
- This repository contains the experimental PyTorch native float8 training UX☆223Updated 10 months ago
- extensible collectives library in triton☆87Updated 2 months ago
- ☆32Updated 2 months ago
- ☆169Updated 5 months ago
- A bunch of kernels that might make stuff slower 😉☆48Updated this week
- A minimal implementation of vllm.☆41Updated 10 months ago
- ☆78Updated 11 months ago
- Make triton easier☆47Updated 11 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆217Updated 6 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆185Updated this week
- a minimal cache manager for PagedAttention, on top of llama3.☆89Updated 9 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆121Updated last week
- High-Performance SGEMM on CUDA devices☆94Updated 4 months ago