NolanoOrg / sparse_quant_llms
SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia
☆41Updated last year
Related projects ⓘ
Alternatives and complementary repositories for sparse_quant_llms
- QuIP quantization☆46Updated 8 months ago
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆97Updated last year
- Implementation of "LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models"☆42Updated last week
- ☆122Updated 9 months ago
- ☆49Updated 8 months ago
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆74Updated 10 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆112Updated last year
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated☆30Updated 3 months ago
- ☆96Updated last month
- Exploring finetuning public checkpoints on filter 8K sequences on Pile☆115Updated last year
- Reorder-based post-training quantization for large language model☆181Updated last year
- Experiments on speculative sampling with Llama models☆118Updated last year
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated last year
- Code repository for the c-BTM paper☆105Updated last year
- ☆35Updated 3 weeks ago
- KV cache compression for high-throughput LLM inference☆87Updated this week
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆111Updated 10 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- Here we will test various linear attention designs.☆56Updated 6 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated 2 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆89Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- ☆44Updated 11 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆82Updated 8 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆92Updated last month
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆74Updated last month