huggingface / optimum-benchmark
ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.
β257Updated this week
Related projects β
Alternatives and complementary repositories for optimum-benchmark
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β153Updated this week
- Easy and Efficient Quantization for Transformersβ180Updated 4 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β624Updated 2 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ305Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMsβ253Updated last month
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ685Updated this week
- β157Updated last month
- GPTQ inference Triton kernelβ284Updated last year
- A throughput-oriented high-performance serving framework for LLMsβ636Updated 2 months ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for tβ¦β248Updated this week
- β191Updated this week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Servingβ443Updated last week
- For releasing code related to compression methods for transformers, accompanying our publicationsβ372Updated last month
- Applied AI experiments and examples for PyTorchβ166Updated 3 weeks ago
- β111Updated 8 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.β149Updated last month
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ278Updated 4 months ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β165Updated this week
- An innovative library for efficient LLM inference via low-bit quantizationβ348Updated 2 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ355Updated last week
- OpenAI compatible API for TensorRT LLM triton backendβ177Updated 3 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMsβ89Updated this week
- LLMPerf is a library for validating and benchmarking LLMsβ645Updated 3 months ago
- A low-latency & high-throughput serving engine for LLMsβ245Updated 2 months ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantizationβ649Updated 3 months ago
- Ultra-Fast and Cheaper Long-Context LLM Inferenceβ233Updated this week
- Comparison of Language Model Inference Enginesβ190Updated 2 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β193Updated this week
- A family of compressed models obtained via pruning and knowledge distillationβ283Updated last week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"β350Updated 8 months ago