aredden / torch-bnb-fp4Links

Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops

☆30

Alternatives and similar repositories for torch-bnb-fp4

Users that are interested in torch-bnb-fp4 are comparing it to the libraries listed below

Sorting:

meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆225Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆159Updated 2 years ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
mgmalek / efficient_cross_entropy
☆121Updated last year
stanford-futuredata / stk
☆113Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆61Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago
graphcore-research / out-of-the-box-fp8-training
Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.
☆46Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆132Updated 4 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
fmfi-compbio / admm-pruning
☆30Updated last year
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆104Updated last month
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆197Updated 2 years ago
haochengxi / Train_Transformers_with_INT4
☆157Updated 2 years ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆157Updated 2 years ago
epfml / dynamic-sparse-flash-attention
☆150Updated 2 years ago
FasterDecoding / BitDelta
☆203Updated 11 months ago
IST-DASLab / MicroAdam
This repository contains code for the MicroAdam paper.
☆21Updated 11 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆206Updated 5 months ago
FasterDecoding / TEAL
☆148Updated 9 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated last month
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆27Updated 2 years ago
meta-pytorch / superblock
A block oriented training approach for inference time optimization.
☆33Updated last year
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆109Updated 7 months ago
tridao / flash-attention-wheels
☆57Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago