Cornell-RelaxML / qtip
☆60Updated last week
Related projects ⓘ
Alternatives and complementary repositories for qtip
- ☆95Updated last month
- QuIP quantization☆46Updated 7 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆85Updated 3 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- An algorithm for static activation quantization of LLMs☆67Updated this week
- KV cache compression for high-throughput LLM inference☆82Updated last week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated last week
- PB-LLM: Partially Binarized Large Language Models☆146Updated 11 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆71Updated 4 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆161Updated this week
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆54Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆281Updated 3 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆137Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 4 months ago
- ☆182Updated 3 weeks ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆73Updated 3 weeks ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆151Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆108Updated 5 months ago
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆240Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆172Updated 3 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆148Updated last month
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆49Updated 3 weeks ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆58Updated this week
- ☆46Updated last month