AlpinDale / QuIP-for-Llama
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models
☆35Updated last year
Related projects: ⓘ
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆109Updated 8 months ago
- PB-LLM: Partially Binarized Large Language Models☆143Updated 10 months ago
- QuIP quantization☆41Updated 6 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆185Updated 3 weeks ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆339Updated 6 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆96Updated last year
- SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia☆40Updated last year
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆258Updated 10 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆73Updated 3 weeks ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆68Updated 2 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆36Updated 8 months ago
- ☆110Updated 4 months ago
- An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).☆90Updated this week
- Reorder-based post-training quantization for large language model☆178Updated last year
- ☆174Updated 4 months ago
- Code for the paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot" with LLaMA implementation.☆68Updated last year
- Experiments on speculative sampling with Llama models☆114Updated last year
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆55Updated this week
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆87Updated 8 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆163Updated 3 months ago
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆213Updated 3 weeks ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 2 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆107Updated last year
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆123Updated 6 months ago
- Compression for Foundation Models☆18Updated 3 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- ☆71Updated last year
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆44Updated 2 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week