microsoft / onnxruntime-genaiLinks
Generative AI extensions for onnxruntime
☆728Updated this week
Alternatives and similar repositories for onnxruntime-genai
Users that are interested in onnxruntime-genai are comparing it to the libraries listed below
Sorting:
- onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime☆390Updated this week
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆282Updated this week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆467Updated this week
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.☆1,944Updated this week
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆356Updated this week
- Examples for using ONNX Runtime for model training.☆338Updated 7 months ago
- Universal cross-platform tokenizers binding to HF and sentencepiece☆342Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 9 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆590Updated last week
- Low-bit LLM inference on CPU with lookup table☆793Updated last week
- nvidia-modelopt is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculat…☆956Updated 2 weeks ago
- The Triton TensorRT-LLM Backend☆845Updated this week
- Intel® NPU Acceleration Library☆677Updated last month
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,441Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆818Updated this week
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,170Updated 7 months ago
- No-code CLI designed for accelerating ONNX workflows☆192Updated 2 weeks ago
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆186Updated this week
- The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.)…☆706Updated this week
- LLaMa/RWKV onnx models, quantization and testcase☆363Updated last year
- Common utilities for ONNX converters☆270Updated 6 months ago
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆483Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,176Updated 3 weeks ago
- LiteRT is the new name for TensorFlow Lite (TFLite). While the name is new, it's still the same trusted, high-performance runtime for on-…☆469Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆831Updated 9 months ago
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…☆2,419Updated this week
- Supporting PyTorch models with the Google AI Edge TFLite runtime.☆620Updated this week
- ☆1,025Updated last year
- A Python package for extending the official PyTorch that can easily obtain performance on Intel platform☆1,866Updated this week
- llm-export can export llm model to onnx.☆293Updated 4 months ago