run-ai / llmperfLinks
☆53Updated 8 months ago
Alternatives and similar repositories for llmperf
Users that are interested in llmperf are comparing it to the libraries listed below
Sorting:
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆127Updated last month
- Benchmark suite for LLMs from Fireworks.ai☆75Updated 2 weeks ago
- LLM Serving Performance Evaluation Harness☆78Updated 3 months ago
- ☆99Updated this week
- ☆118Updated last year
- ☆260Updated 2 weeks ago
- ☆52Updated 6 months ago
- A low-latency & high-throughput serving engine for LLMs☆370Updated this week
- ☆193Updated 3 weeks ago
- Perplexity GPU Kernels☆324Updated 2 weeks ago
- ☆85Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆198Updated 3 months ago
- The driver for LMCache core to run in vLLM☆41Updated 4 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆61Updated 2 months ago
- ☆71Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆271Updated this week
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆105Updated this week
- Comparison of Language Model Inference Engines☆217Updated 5 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆263Updated 7 months ago
- Stateful LLM Serving☆70Updated 2 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆75Updated 8 months ago
- Modular and structured prompt caching for low-latency LLM inference☆94Updated 6 months ago
- PyTorch distributed training acceleration framework☆49Updated 3 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆211Updated last year
- Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the …☆58Updated last year
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆196Updated this week
- A large-scale simulation framework for LLM inference☆380Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆251Updated 7 months ago
- A minimal implementation of vllm.☆41Updated 10 months ago
- Fast and memory-efficient exact attention☆72Updated last month