microsoft / onnxruntime-genai
Generative AI extensions for onnxruntime
☆612Updated this week
Alternatives and similar repositories for onnxruntime-genai:
Users that are interested in onnxruntime-genai are comparing it to the libraries listed below
- Examples for using ONNX Runtime for model training.☆325Updated 3 months ago
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆316Updated this week
- onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime☆356Updated this week
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.☆1,754Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆747Updated this week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆442Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆708Updated 5 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆352Updated 5 months ago
- TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillati…☆706Updated this week
- LLaMa/RWKV onnx models, quantization and testcase☆356Updated last year
- The Triton TensorRT-LLM Backend☆774Updated this week
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆217Updated this week
- A pytorch quantization backend for optimum☆878Updated last month
- Universal cross-platform tokenizers binding to HF and sentencepiece☆303Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆259Updated 4 months ago
- Common utilities for ONNX converters☆257Updated 2 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆972Updated this week
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,160Updated 4 months ago
- ☆1,023Updated last year
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆253Updated 10 months ago
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…☆2,319Updated this week
- ☆522Updated 3 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆1,945Updated 3 weeks ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆367Updated this week
- OpenAI compatible API for TensorRT LLM triton backend☆191Updated 6 months ago
- Examples for using ONNX Runtime for machine learning inferencing.☆1,293Updated 3 weeks ago
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆274Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆158Updated this week
- Low-bit LLM inference on CPU with lookup table☆670Updated last month
- For releasing code related to compression methods for transformers, accompanying our publications☆405Updated 3 weeks ago