huggingface / optimum-amd
AMD related optimizations for transformer models
☆67Updated 3 months ago
Alternatives and similar repositories for optimum-amd:
Users that are interested in optimum-amd are comparing it to the libraries listed below
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆259Updated 4 months ago
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆171Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆159Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆75Updated this week
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆38Updated 2 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆286Updated 2 weeks ago
- Fast and memory-efficient exact attention☆157Updated this week
- Google TPU optimizations for transformers models☆98Updated 3 weeks ago
- python package of rocm-smi-lib☆20Updated 4 months ago
- Benchmark suite for LLMs from Fireworks.ai☆66Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- ☆105Updated last month
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- ☆197Updated 3 weeks ago
- ☆117Updated 9 months ago
- Data preparation code for Amber 7B LLM☆85Updated 9 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- Fast low-bit matmul kernels in Triton☆236Updated this week
- QuIP quantization☆49Updated 11 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated 5 months ago
- The no-code AI toolchain☆89Updated last week
- Development repository for the Triton language and compiler☆107Updated this week
- vLLM performance dashboard☆20Updated 9 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆234Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated 2 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆35Updated 9 months ago
- Example of applying CUDA graphs to LLaMA-v2☆11Updated last year
- Train, tune, and infer Bamba model☆84Updated last month