insoochung / transformer_bcqLinks
BCQ tutorial for transformers
☆18Updated 2 years ago
Alternatives and similar repositories for transformer_bcq
Users that are interested in transformer_bcq are comparing it to the libraries listed below
Sorting:
- ☆69Updated last year
- Easy and Efficient Quantization for Transformers☆202Updated 4 months ago
- ☆83Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆98Updated 2 years ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆95Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- ☆57Updated last year
- 삼각형의 실전! Triton☆16Updated last year
- JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)☆36Updated last year
- A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.☆31Updated 2 months ago
- ☆121Updated last year
- The evaluation framework for training-free sparse attention in LLMs☆102Updated last month
- ☆127Updated last year
- Experiments on Multi-Head Latent Attention☆98Updated last year
- Official implementation for Training LLMs with MXFP4☆102Updated 6 months ago
- ☆147Updated 9 months ago
- PB-LLM: Partially Binarized Large Language Models☆156Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Updated 3 months ago
- Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]☆88Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆60Updated last year
- Code for studying the super weight in LLM☆119Updated 11 months ago
- Pytorch/XLA SPMD Test code in Google TPU☆23Updated last year
- QuIP quantization☆60Updated last year
- Low-bit optimizers for PyTorch☆132Updated 2 years ago
- OSLO: Open Source for Large-scale Optimization☆174Updated 2 years ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆162Updated 7 months ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆15Updated 10 months ago
- Repository for CPU Kernel Generation for LLM Inference☆26Updated 2 years ago
- ☆27Updated last year