insoochung / transformer_bcq
BCQ tutorial for transformers
☆17Updated last year
Alternatives and similar repositories for transformer_bcq
Users that are interested in transformer_bcq are comparing it to the libraries listed below
Sorting:
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 4 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆106Updated 3 weeks ago
- ☆125Updated last year
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆61Updated last year
- ☆81Updated last year
- Code for studying the super weight in LLM☆100Updated 5 months ago
- ☆25Updated 6 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆37Updated 3 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆106Updated 6 months ago
- 삼각형의 실전! Triton☆16Updated last year
- ☆47Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 9 months ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆91Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆41Updated last year
- ☆41Updated last year
- Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"☆15Updated 2 months ago
- Pytorch/XLA SPMD Test code in Google TPU☆23Updated last year
- ☆29Updated last year
- ☆55Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- ☆69Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆198Updated last year
- Experiment of using Tangent to autodiff triton☆78Updated last year
- ☆26Updated last year
- ☆128Updated 2 months ago
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆56Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 7 months ago
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆13Updated this week
- Awesome Triton Resources☆27Updated 2 weeks ago