insoochung / transformer_bcq
BCQ tutorial for transformers
☆17Updated last year
Alternatives and similar repositories for transformer_bcq:
Users that are interested in transformer_bcq are comparing it to the libraries listed below
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆13Updated 2 months ago
- ☆35Updated 3 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆151Updated 9 months ago
- ☆46Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- ☆22Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆43Updated 7 months ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆89Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆92Updated last year
- ☆94Updated 9 months ago
- ☆25Updated last year
- ☆125Updated last year
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆97Updated 2 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆187Updated last year
- ☆47Updated last year
- Code for studying the super weight in LLM☆91Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆192Updated last month
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated 11 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆58Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆67Updated 6 months ago
- ☆100Updated 6 months ago
- ☆115Updated 3 weeks ago
- ☆23Updated 4 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆68Updated 4 months ago
- ☆80Updated last year
- Awesome Triton Resources☆20Updated 3 months ago
- Explore training for quantized models☆16Updated 2 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆201Updated last year
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆56Updated 5 months ago