facebookresearch / any4Links

Quantize transformers to any learned arbitrary 4-bit numeric format

☆48

Alternatives and similar repositories for any4

Users that are interested in any4 are comparing it to the libraries listed below

Sorting:

microsoft / AttentionEngine
☆102Updated 5 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated last month
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 3 months ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 9 months ago
IST-DASLab / FP-Quant
☆58Updated this week
Qualcomm-AI-research / gptvq
☆36Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
TianjinYellow / StableSPAM
☆25Updated 7 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
IST-DASLab / MicroAdam
This repository contains code for the MicroAdam paper.
☆20Updated 10 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆52Updated 3 months ago
ScalingIntelligence / CATS
☆28Updated 11 months ago
stanford-futuredata / stk
☆112Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆84Updated last year
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆120Updated last week
IST-DASLab / Quartet
☆103Updated this week
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
vedantroy / gpu_kernels
☆27Updated last year
tridao / flash-attention-wheels
☆57Updated last year
Qualcomm-AI-research / lr-qat
☆46Updated 11 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated this week
FasterDecoding / TEAL
☆145Updated 8 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 5 months ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆48Updated 2 weeks ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆183Updated this week
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year