fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆70Updated 2 months ago
Alternatives and similar repositories for benchmark:
Users that are interested in benchmark are comparing it to the libraries listed below
- ☆49Updated 4 months ago
- ☆54Updated 6 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆112Updated 4 months ago
- The driver for LMCache core to run in vLLM☆36Updated 2 months ago
- KV cache compression for high-throughput LLM inference☆125Updated 2 months ago
- LLM Serving Performance Evaluation Harness☆75Updated last month
- ☆185Updated 6 months ago
- ☆117Updated last year
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆64Updated this week
- experiments with inference on llama☆104Updated 10 months ago
- Experiments on speculative sampling with Llama models☆125Updated last year
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆97Updated 3 weeks ago
- vLLM performance dashboard☆26Updated 11 months ago
- ☆241Updated this week
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆93Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆82Updated last month
- IBM development fork of https://github.com/huggingface/text-generation-inference☆60Updated 3 months ago
- Perplexity GPU Kernels☆185Updated last week
- vLLM adapter for a TGIS-compatible gRPC server.☆25Updated this week
- A low-latency & high-throughput serving engine for LLMs☆337Updated 2 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆11Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆65Updated last year
- Inference server benchmarking tool☆48Updated last week
- ☆205Updated 2 months ago
- Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the …☆56Updated last year
- ☆117Updated 11 months ago
- Example of applying CUDA graphs to LLaMA-v2☆12Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 6 months ago