microsoft / onnxruntime-genai
Generative AI extensions for onnxruntime
☆501Updated this week
Related projects ⓘ
Alternatives and complementary repositories for onnxruntime-genai
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆144Updated this week
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆280Updated this week
- TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillati…☆534Updated this week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆404Updated this week
- Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.☆1,594Updated this week
- onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime☆334Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- ☆1,021Updated 10 months ago
- LLaMa/RWKV onnx models, quantization and testcase☆350Updated last year
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆698Updated last week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆494Updated this week
- Examples for using ONNX Runtime for model training.☆311Updated 2 weeks ago
- Common utilities for ONNX converters☆251Updated 4 months ago
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆152Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆661Updated this week
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,136Updated last month
- FlashInfer: Kernel Library for LLM Serving☆1,395Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆611Updated 2 months ago
- A pytorch quantization backend for optimum☆818Updated this week
- ☆501Updated last week
- The Triton TensorRT-LLM Backend☆703Updated this week
- For releasing code related to compression methods for transformers, accompanying our publications☆369Updated 3 weeks ago
- A throughput-oriented high-performance serving framework for LLMs☆629Updated last month
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆251Updated this week
- [NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces in…☆776Updated this week
- Supporting PyTorch models with the Google AI Edge TFLite runtime.☆360Updated this week
- Intel® NPU Acceleration Library☆499Updated last week
- Low-bit LLM inference on CPU with lookup table☆559Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated 3 weeks ago