huggingface / optimumLinks
π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
β3,250Updated last month
Alternatives and similar repositories for optimum
Users that are interested in optimum are comparing it to the libraries listed below
Sorting:
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,088Updated 6 months ago
- Accessible large language models via k-bit quantization for PyTorch.β7,896Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,414Updated 6 months ago
- PyTorch extensions for high performance and large scale training.β3,393Updated 8 months ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β3,081Updated last week
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,689Updated last year
- π€ Evaluate: A library for easily evaluating machine learning models and datasets.β2,401Updated 2 months ago
- π A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (iβ¦β9,443Updated this week
- PyTorch native quantization and sparsity for training and inferenceβ2,617Updated last week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β2,301Updated 8 months ago
- SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, β¦β2,570Updated this week
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,173Updated last year
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,245Updated last year
- Transformer related optimization, including BERT, GPTβ6,382Updated last year
- Fast inference engine for Transformer modelsβ4,252Updated this week
- Simple, safe way to store and distribute tensorsβ3,589Updated 3 weeks ago
- A pytorch quantization backend for optimumβ1,021Updated last month
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.β2,232Updated this week
- S-LoRA: Serving Thousands of Concurrent LoRA Adaptersβ1,893Updated last year
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackabβ¦β1,586Updated last year
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.β5,019Updated 9 months ago
- Sparsity-aware deep learning inference runtime for CPUsβ3,158Updated 7 months ago
- Minimalistic large language model 3D-parallelism trainingβ2,411Updated last month
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ2,580Updated this week
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β833Updated 5 months ago
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,875Updated last year
- Large Language Model Text Generation Inferenceβ10,728Updated last week
- The Triton TensorRT-LLM Backendβ914Updated this week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,689Updated last year
- PyTorch native post-training libraryβ5,642Updated last week