huggingface / optimumLinks
π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
β3,225Updated last week
Alternatives and similar repositories for optimum
Users that are interested in optimum are comparing it to the libraries listed below
Sorting:
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,085Updated 5 months ago
- Accessible large language models via k-bit quantization for PyTorch.β7,855Updated 2 weeks ago
- π€ Evaluate: A library for easily evaluating machine learning models and datasets.β2,386Updated last month
- Simple, safe way to store and distribute tensorsβ3,568Updated last week
- π A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (iβ¦β9,411Updated last week
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,688Updated last year
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,169Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β3,027Updated last week
- SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, β¦β2,558Updated this week
- PyTorch extensions for high performance and large scale training.β3,392Updated 8 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β2,300Updated 7 months ago
- Transformer related optimization, including BERT, GPTβ6,377Updated last year
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ3,399Updated 5 months ago
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackabβ¦β1,584Updated last year
- PyTorch native quantization and sparsity for training and inferenceβ2,591Updated this week
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,241Updated last year
- Sparsity-aware deep learning inference runtime for CPUsβ3,159Updated 6 months ago
- PyTorch native post-training libraryβ5,629Updated this week
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.β5,014Updated 8 months ago
- Minimalistic large language model 3D-parallelism trainingβ2,381Updated 2 weeks ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ2,489Updated last week
- A pytorch quantization backend for optimumβ1,018Updated last month
- Hackable and optimized Transformers building blocks, supporting a composable construction.β10,228Updated this week
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,878Updated last year
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,788Updated this week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,680Updated last year
- Fast inference engine for Transformer modelsβ4,214Updated last week
- S-LoRA: Serving Thousands of Concurrent LoRA Adaptersβ1,882Updated last year
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β833Updated 4 months ago
- A Python package for extending the official PyTorch that can easily obtain performance on Intel platformβ2,001Updated this week