GreenBitAI / low_bit_llamaLinks

Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs

☆110

Alternatives and similar repositories for low_bit_llama

Users that are interested in low_bit_llama are comparing it to the libraries listed below

Sorting:

hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆153Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆54Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆193Updated 2 years ago
FasterDecoding / BitDelta
☆199Updated 8 months ago
qwopqwop200 / gptqlora
GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ
☆103Updated 2 years ago
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆39Updated last month
AlpinDale / QuIP-for-Llama
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models
☆38Updated 2 years ago
haochengxi / Train_Transformers_with_INT4
☆153Updated 2 years ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆71Updated last year
HanGuo97 / lq-lora
☆127Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆175Updated 4 months ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated 9 months ago
IST-DASLab / Quartet
☆75Updated last month
NolanoOrg / sparse_quant_llms
SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia
☆41Updated 2 years ago
horseee / LLaMA-Pruning
Structural Pruning for LLaMA
☆54Updated 2 years ago
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆376Updated last year
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆94Updated 8 months ago
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated 2 weeks ago
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆98Updated 10 months ago
tridao / flash-attention-wheels
☆52Updated last year
mlc-ai / llm-perf-bench
☆120Updated last year
minyoungg / LTE
☆68Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆109Updated 9 months ago
dust-tt / llama-ssp
Experiments on speculative sampling with Llama models
☆128Updated 2 years ago
IST-DASLab / QuEST
Work in progress.
☆70Updated last month
Cornell-RelaxML / qtip
☆145Updated last month