pytorch-labs / superblockLinks

A block oriented training approach for inference time optimization.

☆33

Alternatives and similar repositories for superblock

Users that are interested in superblock are comparing it to the libraries listed below

Sorting:

pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆102Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
cchan / tccl
extensible collectives library in triton
☆88Updated 4 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆199Updated this week
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆128Updated 3 weeks ago
aredden / torch-bnb-fp4
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
☆30Updated last year
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆46Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆215Updated last year
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆92Updated 6 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆80Updated 6 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆152Updated last month
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆339Updated this week
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
mgmalek / efficient_cross_entropy
☆114Updated last year
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated 10 months ago
stanford-futuredata / stk
☆107Updated 11 months ago
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
facebookresearch / any4
Quantize transformers to any learned arbitrary 4-bit numeric format
☆39Updated 3 weeks ago
haochengxi / Train_Transformers_with_INT4
☆154Updated 2 years ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆142Updated this week
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated last week
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆153Updated last year
Qualcomm-AI-research / FP8-quantization
☆154Updated 2 years ago
google / aqt
☆323Updated last week
tridao / flash-attention-wheels
☆53Updated last year