insoochung / transformer_bcqLinks
BCQ tutorial for transformers
☆17Updated last year
Alternatives and similar repositories for transformer_bcq
Users that are interested in transformer_bcq are comparing it to the libraries listed below
Sorting:
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Code for studying the super weight in LLM☆107Updated 6 months ago
- ☆26Updated last year
- ☆126Updated last year
- ☆50Updated last year
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 6 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆63Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆79Updated 9 months ago
- ☆109Updated last year
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Awesome Triton Resources☆31Updated last month
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆92Updated last year
- Benchmarking PyTorch 2.0 different models☆21Updated 2 years ago
- ☆39Updated 7 months ago
- Code for the ACL 2023 paper: "Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Sc…☆30Updated last year
- ☆81Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆131Updated last week
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆80Updated 3 weeks ago
- Triton Implementation of HyperAttention Algorithm☆48Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 11 months ago
- ☆157Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- ☆130Updated 4 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆70Updated last week
- The evaluation framework for training-free sparse attention in LLMs☆69Updated last week
- ☆114Updated 3 weeks ago
- Experiment of using Tangent to autodiff triton☆79Updated last year