pytorch-labs / superblockLinks
A block oriented training approach for inference time optimization.
☆33Updated 9 months ago
Alternatives and similar repositories for superblock
Users that are interested in superblock are comparing it to the libraries listed below
Sorting:
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆127Updated this week
- ☆157Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆44Updated 10 months ago
- This repository contains the experimental PyTorch native float8 training UX☆223Updated 10 months ago
- ☆108Updated last year
- Work in progress.☆67Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆121Updated last week
- Experiment of using Tangent to autodiff triton☆79Updated last year
- A library for unit scaling in PyTorch☆125Updated 6 months ago
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- ☆105Updated 9 months ago
- extensible collectives library in triton☆87Updated 2 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- VIT inference in triton because, why not?☆28Updated last year
- ☆49Updated 10 months ago
- ☆46Updated last week
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆108Updated 7 months ago
- ☆39Updated 7 months ago
- ☆21Updated 3 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆87Updated 4 months ago
- ☆130Updated 2 months ago
- Load compute kernels from the Hub☆139Updated last week
- ☆73Updated 4 months ago
- ☆71Updated 2 months ago
- Fast low-bit matmul kernels in Triton☆311Updated this week
- Prototype routines for GPU quantization written using PyTorch.☆21Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆93Updated last week
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆28Updated last year