hkproj / quantization-notes
Notes on quantization in neural networks
☆70Updated last year
Alternatives and similar repositories for quantization-notes:
Users that are interested in quantization-notes are comparing it to the libraries listed below
- ☆128Updated last month
- LoRA and DoRA from Scratch Implementations☆196Updated 11 months ago
- Complete implementation of Llama2 with/without KV cache & inference 🚀☆47Updated 8 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆178Updated last month
- Reference implementation of Mistral AI 7B v0.1 model.☆28Updated last year
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆94Updated last year
- ☆142Updated last year
- Prune transformer layers☆67Updated 8 months ago
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 9 months ago
- ☆135Updated last week
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆33Updated 2 weeks ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆269Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆272Updated last week
- PB-LLM: Partially Binarized Large Language Models☆151Updated last year
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆342Updated 2 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆102Updated 4 months ago
- Notes about LLaMA 2 model☆53Updated last year
- Unofficial implementation of https://arxiv.org/pdf/2407.14679☆42Updated 5 months ago
- Fast low-bit matmul kernels in Triton☆236Updated this week
- Code for studying the super weight in LLM☆80Updated 2 months ago
- Notebooks for fine tuning pali gemma☆93Updated last month
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆35Updated 9 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆191Updated 7 months ago
- [NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models☆150Updated last month
- Distributed training (multi-node) of a Transformer model☆53Updated 10 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆246Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- Notebook and Scripts that showcase running quantized diffusion models on consumer GPUs☆38Updated 3 months ago