pytorch-labs / superblockLinks
A block oriented training approach for inference time optimization.
☆33Updated 10 months ago
Alternatives and similar repositories for superblock
Users that are interested in superblock are comparing it to the libraries listed below
Sorting:
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 10 months ago
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆49Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆167Updated this week
- ☆157Updated last year
- Work in progress.☆69Updated 2 weeks ago
- extensible collectives library in triton☆86Updated 2 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆88Updated 5 months ago
- ☆105Updated 10 months ago
- ☆51Updated 11 months ago
- ☆39Updated 7 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆30Updated last year
- pytorch-profiler☆51Updated 2 years ago
- ☆75Updated 5 months ago
- ☆41Updated 4 years ago
- A library for unit scaling in PyTorch☆125Updated 6 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Experiment of using Tangent to autodiff triton☆79Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 11 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆201Updated last year
- Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.☆50Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆109Updated 8 months ago
- Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)☆47Updated 3 weeks ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆74Updated last week
- Evaluation Code repository for the paper "ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers". (2023…☆13Updated last year
- Dynamic Neural Architecture Search Toolkit☆30Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆131Updated last week
- ☆31Updated last year
- ☆10Updated 3 years ago