hkproj / quantization-notes
Notes on quantization in neural networks
☆58Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for quantization-notes
- LLaMA 2 implemented from scratch in PyTorch☆254Updated last year
- A family of compressed models obtained via pruning and knowledge distillation☆283Updated last week
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆166Updated 3 months ago
- Reference implementation of Mistral AI 7B v0.1 model.☆27Updated 10 months ago
- LoRA and DoRA from Scratch Implementations☆188Updated 8 months ago
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆82Updated last year
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆284Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆241Updated last month
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Complete implementation of Llama2 with/without KV cache & inference 🚀☆47Updated 5 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆87Updated last month
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆165Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆229Updated 3 weeks ago
- Prune transformer layers☆64Updated 5 months ago
- This code repository contains the code used for my "Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch" blog po…☆86Updated last year
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆685Updated this week
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆158Updated last week
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- Google TPU optimizations for transformers models☆75Updated this week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆254Updated 2 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆143Updated this week
- Slides, notes, and materials for the workshop☆306Updated 5 months ago
- ☆133Updated 9 months ago
- Cataloging released Triton kernels.☆134Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆166Updated 3 weeks ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆112Updated 8 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- Distributed training (multi-node) of a Transformer model☆43Updated 7 months ago