insoochung / transformer_bcq
BCQ tutorial for transformers
☆16Updated last year
Related projects ⓘ
Alternatives and complementary repositories for transformer_bcq
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated last month
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆112Updated 8 months ago
- ☆122Updated 10 months ago
- ☆77Updated 5 months ago
- ☆96Updated last month
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- ☆22Updated 10 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆86Updated 9 months ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆57Updated this week
- A block oriented training approach for inference time optimization.☆30Updated 3 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆53Updated last month
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆89Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- ☆63Updated last month
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆61Updated last month
- Pytorch/XLA SPMD Test code in Google TPU☆21Updated 7 months ago
- Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*☆80Updated 11 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated 11 months ago
- ☆74Updated 11 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆83Updated 3 months ago
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆50Updated 7 months ago
- LLM KV cache compression made easy☆168Updated this week
- 삼각형의 실전! Triton☆15Updated 9 months ago
- Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam☆68Updated 3 months ago
- ☆44Updated 11 months ago
- ☆55Updated 6 months ago
- Experiment of using Tangent to autodiff triton☆72Updated 10 months ago