microsoft / onnxruntime-genai
Generative AI extensions for onnxruntime
☆649Updated last week
Alternatives and similar repositories for onnxruntime-genai:
Users that are interested in onnxruntime-genai are comparing it to the libraries listed below
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆236Updated this week
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.☆1,828Updated this week
- onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime☆366Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated 6 months ago
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆323Updated this week
- Examples for using ONNX Runtime for model training.☆329Updated 4 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆765Updated this week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆449Updated this week
- Low-bit LLM inference on CPU with lookup table☆702Updated 2 months ago
- A unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, speculative decoding, et…☆808Updated this week
- LLaMa/RWKV onnx models, quantization and testcase☆359Updated last year
- Universal cross-platform tokenizers binding to HF and sentencepiece☆311Updated 3 weeks ago
- The Triton TensorRT-LLM Backend☆806Updated last week
- A pytorch quantization backend for optimum☆900Updated 2 weeks ago
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,161Updated 5 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,103Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆771Updated 6 months ago
- ☆1,025Updated last year
- The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.)…☆639Updated this week
- Common utilities for ONNX converters☆259Updated 3 months ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆394Updated this week
- ☆937Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- Intel® NPU Acceleration Library☆643Updated 2 months ago
- A throughput-oriented high-performance serving framework for LLMs☆766Updated 6 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆416Updated 2 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆375Updated this week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆612Updated this week
- Examples for using ONNX Runtime for machine learning inferencing.☆1,332Updated last month
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)☆550Updated this week