pytorch-labs / superblock
A block oriented training approach for inference time optimization.
☆29Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for superblock
- ☆76Updated 5 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Experiment of using Tangent to autodiff triton☆71Updated 9 months ago
- ☆45Updated last month
- A library for unit scaling in PyTorch☆105Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- ☆88Updated 2 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆137Updated this week
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆35Updated 3 months ago
- ☆144Updated this week
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated 3 weeks ago
- ☆156Updated last year
- extensible collectives library in triton☆63Updated last month
- FlexAttention w/ FlashAttention3 Support☆26Updated last month
- Applied AI experiments and examples for PyTorch☆160Updated last week
- CUDA implementation of autoregressive linear attention, with all the latest research findings☆43Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆59Updated 7 months ago
- Megatron's multi-modal data loader☆130Updated this week
- ☆95Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- ☆96Updated last month
- Cataloging released Triton kernels.☆132Updated 2 months ago
- ☆41Updated 11 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆184Updated last month
- Collection of kernels written in Triton language☆63Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- ☆132Updated last year
- Using FlexAttention to compute attention with different masking patterns☆40Updated last month