huggingface / optimum
π Accelerate training and inference of π€ Transformers and π€ Diffusers with easy to use hardware optimization tools
β2,576Updated this week
Related projects β
Alternatives and complementary repositories for optimum
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β1,904Updated this week
- Accessible large language models via k-bit quantization for PyTorch.β6,299Updated this week
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β1,941Updated 7 months ago
- Transformer related optimization, including BERT, GPTβ5,890Updated 7 months ago
- π€ Evaluate: A library for easily evaluating machine learning models and datasets.β2,037Updated 2 months ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUsβ¦β1,979Updated this week
- PyTorch extensions for high performance and large scale training.β3,195Updated last week
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,138Updated last month
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ2,526Updated last month
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackabβ¦β1,535Updated 9 months ago
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,659Updated 3 weeks ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β1,765Updated this week
- π A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (iβ¦β7,958Updated this week
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,227Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMsβ2,205Updated this week
- Sparsity-aware deep learning inference runtime for CPUsβ3,028Updated 4 months ago
- Fast inference engine for Transformer modelsβ3,411Updated this week
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.β4,497Updated last month
- Simple, safe way to store and distribute tensorsβ2,900Updated 2 weeks ago
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,814Updated 9 months ago
- SGLang is a fast serving framework for large language models and vision language models.β6,127Updated this week
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,893Updated last month
- A pytorch quantization backend for optimumβ824Updated last week
- Training and serving large-scale neural networks with auto parallelization.β3,077Updated 11 months ago
- The hub for EleutherAI's work on interpretability and learning dynamicsβ2,282Updated 2 weeks ago
- A framework for few-shot evaluation of language models.β6,990Updated this week
- Fast and memory-efficient exact attentionβ14,279Updated this week
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flβ¦β2,409Updated 3 months ago