huggingface / optimum-amd
AMD related optimizations for transformer models
☆57Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for optimum-amd
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- ☆99Updated last month
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆153Updated this week
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- Google TPU optimizations for transformers models☆75Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆125Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆150Updated last month
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆257Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆79Updated this week
- Applied AI experiments and examples for PyTorch☆166Updated 3 weeks ago
- ☆114Updated 7 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆226Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- ☆67Updated last week
- KV cache compression for high-throughput LLM inference☆87Updated this week
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆112Updated 8 months ago
- Materials for learning SGLang☆105Updated this week
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆164Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆50Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆209Updated 3 weeks ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆145Updated this week
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- vLLM performance dashboard☆18Updated 6 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆165Updated this week
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆173Updated 4 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆193Updated this week