argonne-lcf / LLM-Inference-BenchLinks
LLM-Inference-Bench
☆45Updated 2 weeks ago
Alternatives and similar repositories for LLM-Inference-Bench
Users that are interested in LLM-Inference-Bench are comparing it to the libraries listed below
Sorting:
- ☆62Updated last year
- Stateful LLM Serving☆73Updated 3 months ago
- LLM Serving Performance Evaluation Harness☆78Updated 4 months ago
- ☆31Updated last month
- A lightweight design for computation-communication overlap.☆143Updated this week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆214Updated last year
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆42Updated last month
- A minimal implementation of vllm.☆44Updated 10 months ago
- ☆75Updated 5 months ago
- nnScaler: Compiling DNN models for Parallel Training☆113Updated this week
- LLM Inference analyzer for different hardware platforms☆74Updated 3 weeks ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆30Updated 7 months ago
- ☆47Updated 11 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆198Updated last week
- ☆39Updated 5 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆123Updated last year
- An experimentation platform for LLM inference optimisation☆31Updated 9 months ago
- NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading☆39Updated last week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆163Updated 9 months ago
- ☆71Updated last month
- ☆84Updated 3 years ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆39Updated 2 months ago
- ☆72Updated 3 months ago
- DeeperGEMM: crazy optimized version☆69Updated last month
- ☆67Updated 8 months ago
- A simple calculation for LLM MFU.☆38Updated 3 months ago
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆50Updated last month
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆48Updated 3 months ago
- ☆77Updated 2 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆90Updated 2 weeks ago