fw-ai / benchmarkLinks
Benchmark suite for LLMs from Fireworks.ai
☆76Updated 2 weeks ago
Alternatives and similar repositories for benchmark
Users that are interested in benchmark are comparing it to the libraries listed below
Sorting:
- ☆54Updated 7 months ago
- LLM Serving Performance Evaluation Harness☆78Updated 3 months ago
- ☆55Updated 9 months ago
- ☆155Updated this week
- KV cache compression for high-throughput LLM inference☆130Updated 4 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆119Updated this week
- ☆194Updated last month
- Inference server benchmarking tool☆73Updated last month
- ☆119Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- Easy and Efficient Quantization for Transformers☆199Updated 4 months ago
- ☆118Updated last year
- Experiments on speculative sampling with Llama models☆128Updated 2 years ago
- vLLM performance dashboard☆30Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆264Updated 8 months ago
- ☆267Updated last week
- experiments with inference on llama☆104Updated last year
- vLLM adapter for a TGIS-compatible gRPC server.☆32Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆79Updated 9 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆126Updated this week
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated last week
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆122Updated last year
- ☆212Updated 4 months ago
- IBM development fork of https://github.com/huggingface/text-generation-inference☆60Updated last month
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆98Updated last year
- The driver for LMCache core to run in vLLM☆41Updated 4 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated last year
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆78Updated last month
- Example of applying CUDA graphs to LLaMA-v2☆12Updated last year
- Data preparation code for Amber 7B LLM☆91Updated last year