huggingface / optimum-amd
AMD related optimizations for transformer models
☆57Updated this week
Related projects ⓘ
Alternatives and complementary repositories for optimum-amd
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆152Updated this week
- ☆114Updated 6 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆148Updated last month
- Materials for learning SGLang☆75Updated this week
- Google TPU optimizations for transformers models☆74Updated this week
- Simple and fast low-bit matmul kernels in CUDA / Triton☆133Updated this week
- ☆96Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated 3 weeks ago
- Applied AI experiments and examples for PyTorch☆159Updated last week
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆183Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- ☆52Updated last month
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆251Updated this week
- Fast and memory-efficient exact attention☆138Updated this week
- ☆156Updated last month
- ☆44Updated last month
- KV cache compression for high-throughput LLM inference☆82Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆196Updated last week
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆38Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated last week
- ☆55Updated 5 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆118Updated this week
- Large Language Model Text Generation Inference on Habana Gaudi☆26Updated this week
- python package of rocm-smi-lib☆18Updated last month