onnx / turnkeyml
Local LLM Server with NPU Acceleration
☆144Updated this week
Alternatives and similar repositories for turnkeyml:
Users that are interested in turnkeyml are comparing it to the libraries listed below
- AI Tensor Engine for ROCm☆160Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆352Updated 7 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- llama.cpp fork with additional SOTA quants and improved performance☆276Updated this week
- OpenAI Triton backend for Intel® GPUs☆178Updated this week
- ☆118Updated 11 months ago
- Development repository for the Triton language and compiler☆118Updated this week
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆86Updated this week
- ☆156Updated 2 weeks ago
- Model compression for ONNX☆91Updated 5 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆579Updated 2 months ago
- Use safetensors with ONNX 🤗☆50Updated last month
- ☆207Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 6 months ago
- ☆60Updated last year
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆179Updated last year
- Fast low-bit matmul kernels in Triton☆288Updated this week
- ☆122Updated 3 weeks ago
- Repository of model demos using TT-Buda☆63Updated 2 weeks ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆427Updated this week
- ☆105Updated last week
- Lightweight Inference server for OpenVINO☆152Updated this week
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆337Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆72Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆37Updated 7 months ago
- OpenVINO Tokenizers extension☆32Updated last week
- AMD's graph optimization engine.☆214Updated this week
- ☆250Updated this week
- This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow …☆42Updated this week
- High-Performance SGEMM on CUDA devices☆90Updated 2 months ago