saic-fi / MobileQuant
[EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models
☆41Updated last month
Related projects ⓘ
Alternatives and complementary repositories for MobileQuant
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆73Updated 3 weeks ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- QuIP quantization☆46Updated 7 months ago
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆28Updated 4 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆27Updated last week
- ☆62Updated last month
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated 2 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆71Updated 5 months ago
- ☆42Updated 11 months ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated last week
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆111Updated 10 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆85Updated 3 weeks ago
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆47Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- ☆36Updated last month
- ☆34Updated 9 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆91Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆49Updated 6 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- ☆46Updated last month
- ☆43Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- KV cache compression for high-throughput LLM inference☆82Updated last week
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆45Updated 4 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- PB-LLM: Partially Binarized Large Language Models☆146Updated 11 months ago