GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆109Updated 8 months ago
Related projects: ⓘ
- PB-LLM: Partially Binarized Large Language Models☆143Updated 10 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆73Updated 3 weeks ago
- Reorder-based post-training quantization for large language model☆178Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆35Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆68Updated 2 months ago
- Code for QuaRot, an end-to-end 4-bit inference of large language models.☆256Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆156Updated this week
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆213Updated 3 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆36Updated 8 months ago
- QuIP quantization☆41Updated 6 months ago
- ☆75Updated this week
- Fast Inference of MoE Models with CPU-GPU Orchestration☆163Updated 3 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆123Updated 3 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆258Updated 10 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆96Updated last year
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆55Updated this week
- ☆174Updated 4 months ago
- SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia☆40Updated last year
- ☆164Updated 4 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆87Updated last year
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- ☆117Updated 7 months ago
- ☆130Updated last year
- (ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆175Updated 3 months ago
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆44Updated 3 months ago
- Simple and fast low-bit matmul kernels in CUDA☆48Updated this week