huggingface / inference-benchmarker
Inference server benchmarking tool
☆38Updated this week
Alternatives and similar repositories for inference-benchmarker:
Users that are interested in inference-benchmarker are comparing it to the libraries listed below
- Google TPU optimizations for transformers models☆104Updated 2 months ago
- experiments with inference on llama☆104Updated 9 months ago
- Benchmark suite for LLMs from Fireworks.ai☆70Updated last month
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆135Updated 8 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆253Updated 8 months ago
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching o…☆127Updated 3 months ago
- ☆112Updated 6 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆195Updated 8 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆236Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆60Updated 3 months ago
- ☆176Updated this week
- Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research☆151Updated this week
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆54Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- Load compute kernels from the Hub☆107Updated this week
- ☆199Updated last year
- Comprehensive analysis of difference in performance of QLora, Lora, and Full Finetunes.☆82Updated last year
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆124Updated 7 months ago
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.☆34Updated 3 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆279Updated last month
- Set of scripts to finetune LLMs☆37Updated last year
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆35Updated 11 months ago
- ☆66Updated 10 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆111Updated 9 months ago
- LLM KV cache compression made easy☆444Updated 2 weeks ago
- A collection of all available inference solutions for the LLMs☆82Updated last month
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆234Updated this week
- A stable, fast and easy-to-use inference library with a focus on a sync-to-async API☆45Updated 6 months ago