andreyanufr / who_what_benchmarkLinks
β20Updated last year
Alternatives and similar repositories for who_what_benchmark
Users that are interested in who_what_benchmark are comparing it to the libraries listed below
Sorting:
- π€ Optimum Intel: Accelerate inference with Intel optimization toolsβ496Updated this week
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β197Updated this week
- ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Oβ¦β315Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ266Updated 11 months ago
- Neural Network Compression Framework for enhanced OpenVINOβ’ inferenceβ1,083Updated this week
- An innovative library for efficient LLM inference via low-bit quantizationβ348Updated last year
- Easy and Efficient Quantization for Transformersβ203Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMsβ83Updated this week
- Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipesβ392Updated 3 months ago
- Prune a model while finetuning or training.β405Updated 3 years ago
- This repository contains tutorials and examples for Triton Inference Serverβ774Updated last week
- Official implementation of Half-Quadratic Quantization (HQQ)β878Updated 2 weeks ago
- Reference models for Intel(R) Gaudi(R) AI Acceleratorβ166Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.β638Updated this week
- A pytorch quantization backend for optimumβ987Updated last month
- The Triton TensorRT-LLM Backendβ890Updated last week
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,497Updated this week
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Servβ¦β490Updated 2 weeks ago
- Examples for using ONNX Runtime for model training.β348Updated 11 months ago
- Explainable AI Tooling (XAI). XAI is used to discover and explain a model's prediction in a way that is interpretable to the user. Relevaβ¦β39Updated this week
- The Triton backend for the ONNX Runtime.β162Updated 2 weeks ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β900Updated last year
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtimeβ342Updated this week
- Fast sparse deep learning on CPUsβ56Updated 2 years ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantizationβ703Updated last year
- A collection of LogitsProcessors to customize and enhance LLM behavior for specific tasks.β358Updated 2 months ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"β328Updated 7 months ago
- A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. β¦β1,397Updated this week
- π¦ XβLLM: Cutting Edge & Easy LLM Finetuningβ407Updated last year
- A family of compressed models obtained via pruning and knowledge distillationβ352Updated 10 months ago