insoochung / transformer_bcqLinks
BCQ tutorial for transformers
☆17Updated 2 years ago
Alternatives and similar repositories for transformer_bcq
Users that are interested in transformer_bcq are comparing it to the libraries listed below
Sorting:
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Easy and Efficient Quantization for Transformers☆203Updated 2 months ago
- ☆27Updated last year
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 9 months ago
- ☆83Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆96Updated 2 years ago
- ☆128Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆94Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆60Updated 11 months ago
- ☆45Updated 10 months ago
- PB-LLM: Partially Binarized Large Language Models☆154Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆82Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆87Updated 2 months ago
- Code for studying the super weight in LLM☆119Updated 9 months ago
- ☆56Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆146Updated last year
- Experiment of using Tangent to autodiff triton☆81Updated last year
- Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]☆88Updated last year
- ☆119Updated last year
- 삼각형의 실전! Triton☆16Updated last year
- ☆29Updated 10 months ago
- ☆142Updated 7 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆155Updated 5 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆214Updated last year
- ☆112Updated last year
- Pytorch/XLA SPMD Test code in Google TPU☆23Updated last year
- A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.☆31Updated 3 weeks ago
- Evaluation Code repository for the paper "ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers". (2023…☆13Updated last year
- Official implementation for Training LLMs with MXFP4☆91Updated 4 months ago
- ☆74Updated 5 months ago