sgl-project / genai-benchLinks
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serving systems.
☆112Updated this week
Alternatives and similar repositories for genai-bench
Users that are interested in genai-bench are comparing it to the libraries listed below
Sorting:
- ☆87Updated 3 months ago
- ☆60Updated 2 months ago
- A simple calculation for LLM MFU.☆38Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- Allow torch tensor memory to be released and resumed later☆40Updated last week
- ☆97Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- DeeperGEMM: crazy optimized version☆69Updated last month
- A lightweight design for computation-communication overlap.☆143Updated last week
- ☆77Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆127Updated 5 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆79Updated 9 months ago
- ☆84Updated 3 years ago
- Stateful LLM Serving☆73Updated 3 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆100Updated last year
- PyTorch bindings for CUTLASS grouped GEMM.☆100Updated 3 weeks ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆57Updated 10 months ago
- A Quirky Assortment of CuTe Kernels☆117Updated this week
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆210Updated 10 months ago
- ☆104Updated 7 months ago
- nnScaler: Compiling DNN models for Parallel Training☆113Updated last week
- Sequence-level 1F1B schedule for LLMs.☆17Updated last year
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆54Updated last week
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆87Updated last month
- A Python library transfers PyTorch tensors between CPU and NVMe☆116Updated 7 months ago
- ☆62Updated last year
- ☆71Updated last month
- A minimal implementation of vllm.☆44Updated 11 months ago
- ☆141Updated 3 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆164Updated 9 months ago