huggingface / optimumLinks
π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
β2,950Updated this week
Alternatives and similar repositories for optimum
Users that are interested in optimum are comparing it to the libraries listed below
Sorting:
- Accessible large language models via k-bit quantization for PyTorch.β7,150Updated this week
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,020Updated 2 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β2,193Updated last month
- PyTorch extensions for high performance and large scale training.β3,331Updated last month
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,081Updated last week
- π A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (iβ¦β8,860Updated this week
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,169Updated 8 months ago
- π€ Evaluate: A library for easily evaluating machine learning models and datasets.β2,241Updated this week
- Simple, safe way to store and distribute tensorsβ3,311Updated last week
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,688Updated 8 months ago
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,434Updated this week
- Sparsity-aware deep learning inference runtime for CPUsβ3,152Updated 3 weeks ago
- Transformer related optimization, including BERT, GPTβ6,211Updated last year
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,131Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaβ¦β2,507Updated this week
- A pytorch quantization backend for optimumβ955Updated this week
- S-LoRA: Serving Thousands of Concurrent LoRA Adaptersβ1,836Updated last year
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.β4,873Updated 2 months ago
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackabβ¦β1,572Updated last year
- Large Language Model Text Generation Inferenceβ10,249Updated this week
- Minimalistic large language model 3D-parallelism trainingβ1,942Updated this week
- Hackable and optimized Transformers building blocks, supporting a composable construction.β9,610Updated this week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,549Updated 11 months ago
- Fast inference engine for Transformer modelsβ3,867Updated 2 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ1,518Updated this week
- PyTorch native quantization and sparsity for training and inferenceβ2,125Updated this week
- A framework for few-shot evaluation of language models.β9,326Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMsβ3,024Updated last month
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,426Updated 11 months ago
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,867Updated last year