insoochung / transformer_bcqLinks
BCQ tutorial for transformers
☆17Updated 2 years ago
Alternatives and similar repositories for transformer_bcq
Users that are interested in transformer_bcq are comparing it to the libraries listed below
Sorting:
- ☆81Updated last year
- ☆127Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆93Updated last year
- PB-LLM: Partially Binarized Large Language Models☆153Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- ☆69Updated last year
- Pytorch/XLA SPMD Test code in Google TPU☆23Updated last year
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 7 months ago
- Easy and Efficient Quantization for Transformers☆198Updated last month
- Awesome Triton Resources☆32Updated 3 months ago
- Experiment of using Tangent to autodiff triton☆80Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- ☆26Updated last year
- Code for studying the super weight in LLM☆114Updated 8 months ago
- ☆114Updated last year
- Experiments on Multi-Head Latent Attention☆94Updated 11 months ago
- Triton Implementation of HyperAttention Algorithm☆48Updated last year
- [ACL 2025] Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models☆29Updated last week
- ☆68Updated last year
- ☆27Updated 8 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆140Updated last year
- Low-bit optimizers for PyTorch☆130Updated last year
- Benchmarking PyTorch 2.0 different models☆20Updated 2 years ago
- ☆154Updated 2 years ago
- JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)☆35Updated last year
- Fast and memory-efficient exact attention☆69Updated 5 months ago
- Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]☆85Updated 10 months ago
- ☆137Updated 5 months ago