andreyanufr / who_what_benchmarkLinks
β20Updated last year
Alternatives and similar repositories for who_what_benchmark
Users that are interested in who_what_benchmark are comparing it to the libraries listed below
Sorting:
- π€ Optimum Intel: Accelerate inference with Intel optimization toolsβ477Updated this week
- ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Oβ¦β305Updated last month
- An innovative library for efficient LLM inference via low-bit quantizationβ349Updated 10 months ago
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β190Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ77Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Traβ¦β528Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ264Updated 9 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)β845Updated last week
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Rβ¦β2,452Updated this week
- Neural Network Compression Framework for enhanced OpenVINOβ’ inferenceβ1,056Updated last week
- Tools for easier OpenVINO development/debuggingβ9Updated 4 months ago
- A pytorch quantization backend for optimumβ963Updated last week
- Reference models for Intel(R) Gaudi(R) AI Acceleratorβ166Updated last week
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needsβ405Updated this week
- Easy and Efficient Quantization for Transformersβ198Updated 3 weeks ago
- A collection of LogitsProcessors to customize and enhance LLM behavior for specific tasks.β315Updated last week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ1,640Updated this week
- For releasing code related to compression methods for transformers, accompanying our publicationsβ435Updated 6 months ago
- Prune a model while finetuning or training.β403Updated 3 years ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"β302Updated 5 months ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantizationβ695Updated 11 months ago
- β274Updated last month
- LLM Workshop by Sourab Mangrulkarβ387Updated last year
- The Triton TensorRT-LLM Backendβ863Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β717Updated 4 months ago
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,169Updated 9 months ago
- The repository for the code of the UltraFastBERT paperβ516Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β855Updated 10 months ago
- Official PyTorch implementation of QA-LoRAβ138Updated last year
- This repository contains tutorials and examples for Triton Inference Serverβ735Updated last month