andreyanufr / who_what_benchmarkLinks
☆21Updated last year
Alternatives and similar repositories for who_what_benchmark
Users that are interested in who_what_benchmark are comparing it to the libraries listed below
Sorting:
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆528Updated this week
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆326Updated 3 months ago
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆204Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last month
- SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, …☆2,570Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆905Updated last month
- Neural Network Compression Framework for enhanced OpenVINO™ inference☆1,115Updated last week
- Easy and Efficient Quantization for Transformers☆202Updated 6 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated last year
- A pytorch quantization backend for optimum☆1,021Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆86Updated last week
- The Triton TensorRT-LLM Backend☆914Updated last week
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆815Updated this week
- Prune a model while finetuning or training.☆405Updated 3 years ago
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,173Updated last year
- Tools for easier OpenVINO development/debugging☆10Updated 6 months ago
- A collection of LogitsProcessors to customize and enhance LLM behavior for specific tasks.☆382Updated 6 months ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆712Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆987Updated last year
- 🦖 X—LLM: Cutting Edge & Easy LLM Finetuning☆407Updated 2 years ago
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆833Updated 5 months ago
- Examples for using ONNX Runtime for model training.☆358Updated last year
- NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day☆259Updated 2 years ago
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…☆1,848Updated this week
- The robust European language model benchmark.☆150Updated last week
- This repository contains tutorials and examples for Triton Inference Server☆813Updated last week
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆502Updated this week
- ☆324Updated last week
- Reference models for Intel(R) Gaudi(R) AI Accelerator☆169Updated last week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,590Updated last year