insoochung / transformer_bcq
BCQ tutorial for transformers
☆16Updated last year
Alternatives and similar repositories for transformer_bcq:
Users that are interested in transformer_bcq are comparing it to the libraries listed below
- ☆25Updated last year
- ☆85Updated 8 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆114Updated 10 months ago
- 삼각형의 실전! Triton☆15Updated 11 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆58Updated 3 months ago
- ☆125Updated last year
- ☆45Updated last year
- ☆74Updated last year
- ☆97Updated 5 months ago
- Pytorch/XLA SPMD Test code in Google TPU☆23Updated 9 months ago
- ☆47Updated 5 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆67Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 4 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- Transformers components but in Triton☆31Updated 2 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 3 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆43Updated 6 months ago
- ☆37Updated 9 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆33Updated 7 months ago
- Code for studying the super weight in LLM☆72Updated last month
- ☆41Updated last year
- Experiment of using Tangent to autodiff triton☆74Updated last year
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆33Updated this week
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆56Updated 4 months ago
- ☆108Updated 4 months ago
- ☆46Updated last year
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆93Updated last month
- The official implementation of the EMNLP 2023 paper LLM-FP4☆176Updated last year
- A library for unit scaling in PyTorch☆122Updated 2 months ago