huggingface / optimumLinks
π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
β2,977Updated this week
Alternatives and similar repositories for optimum
Users that are interested in optimum are comparing it to the libraries listed below
Sorting:
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,031Updated 2 weeks ago
- Accessible large language models via k-bit quantization for PyTorch.β7,212Updated last week
- Simple, safe way to store and distribute tensorsβ3,345Updated last week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,140Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaβ¦β2,548Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β2,206Updated 2 months ago
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,449Updated last week
- Transformer related optimization, including BERT, GPTβ6,238Updated last year
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,688Updated 8 months ago
- π€ Evaluate: A library for easily evaluating machine learning models and datasets.β2,259Updated this week
- Fast inference engine for Transformer modelsβ3,902Updated 3 months ago
- π A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (iβ¦β8,914Updated last week
- PyTorch native quantization and sparsity for training and inferenceβ2,168Updated this week
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,139Updated last year
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,167Updated 9 months ago
- A pytorch quantization backend for optimumβ962Updated last week
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackabβ¦β1,575Updated last year
- PyTorch extensions for high performance and large scale training.β3,337Updated 2 months ago
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.β4,891Updated 3 months ago
- Sparsity-aware deep learning inference runtime for CPUsβ3,157Updated last month
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,868Updated last year
- β2,839Updated last month
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMsβ3,274Updated last month
- S-LoRA: Serving Thousands of Concurrent LoRA Adaptersβ1,843Updated last year
- Large Language Model Text Generation Inferenceβ10,311Updated last week
- Minimalistic large language model 3D-parallelism trainingβ2,012Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (Nβ¦β4,655Updated 3 months ago
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β806Updated 2 weeks ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ1,616Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,473Updated this week