huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
☆286Updated 2 weeks ago
Alternatives and similar repositories for optimum-benchmark:
Users that are interested in optimum-benchmark are comparing it to the libraries listed below
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆723Updated 5 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆259Updated 4 months ago
- Easy and Efficient Quantization for Transformers☆193Updated last week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆496Updated this week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆332Updated 6 months ago
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆171Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆360Updated 11 months ago
- ☆172Updated 4 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆406Updated last month
- GPTQ inference Triton kernel☆295Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆294Updated 7 months ago
- LLM KV cache compression made easy☆394Updated this week
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆676Updated 6 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆159Updated last week
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated 5 months ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆371Updated this week
- Applied AI experiments and examples for PyTorch☆224Updated this week
- Official PyTorch implementation of QA-LoRA☆125Updated 11 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆269Updated 5 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆977Updated this week
- ☆224Updated this week
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆190Updated this week
- Manage scalable open LLM inference endpoints in Slurm clusters☆252Updated 7 months ago
- A throughput-oriented high-performance serving framework for LLMs☆737Updated 4 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆265Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆229Updated this week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆339Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- Comparison of Language Model Inference Engines☆204Updated 2 months ago
- ☆52Updated 5 months ago