andreyanufr / who_what_benchmarkLinks
β20Updated last year
Alternatives and similar repositories for who_what_benchmark
Users that are interested in who_what_benchmark are comparing it to the libraries listed below
Sorting:
- π€ Optimum Intel: Accelerate inference with Intel optimization toolsβ502Updated this week
- ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Oβ¦β318Updated last month
- Easy and Efficient Quantization for Transformersβ202Updated 4 months ago
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β199Updated last week
- A pytorch quantization backend for optimumβ999Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ266Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)β884Updated this week
- Neural Network Compression Framework for enhanced OpenVINOβ’ inferenceβ1,091Updated last week
- An innovative library for efficient LLM inference via low-bit quantizationβ349Updated last year
- π¦ XβLLM: Cutting Edge & Easy LLM Finetuningβ406Updated last year
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantizationβ704Updated last year
- A collection of LogitsProcessors to customize and enhance LLM behavior for specific tasks.β369Updated 3 months ago
- For releasing code related to compression methods for transformers, accompanying our publicationsβ447Updated 9 months ago
- πΉοΈ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.β140Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMsβ83Updated this week
- Official PyTorch implementation of QA-LoRAβ141Updated last year
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.β679Updated this week
- Prune a model while finetuning or training.β405Updated 3 years ago
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,517Updated this week
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β823Updated 2 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β922Updated last year
- Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipesβ389Updated 4 months ago
- The Triton TensorRT-LLM Backendβ903Updated this week
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Modelsβ249Updated last year
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,164Updated last year
- VPTQ, A Flexible and Extreme low-bit quantization algorithmβ659Updated 6 months ago
- This repository contains tutorials and examples for Triton Inference Serverβ792Updated 2 weeks ago
- Late Interaction Models Training & Retrievalβ632Updated this week
- The repository for the code of the UltraFastBERT paperβ518Updated last year
- LLM Workshop by Sourab Mangrulkarβ394Updated last year