jd-opensource / xllm-serviceLinks
A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.
☆86Updated 2 weeks ago
Alternatives and similar repositories for xllm-service
Users that are interested in xllm-service are comparing it to the libraries listed below
Sorting:
- FlagCX is a scalable and adaptive cross-chip communication library.☆172Updated this week
- ☆34Updated last year
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Updated last month
- A GPU-driven system framework for scalable AI applications☆124Updated last year
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Updated 2 weeks ago
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆73Updated 9 months ago
- PyTorch distributed training acceleration framework☆55Updated 5 months ago
- ☆152Updated last year
- ☆96Updated 10 months ago
- Fast and memory-efficient exact attention☆114Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆298Updated 3 weeks ago
- High Performance LLM Inference Operator Library☆695Updated last week
- ☆47Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆120Updated last year
- High performance Transformer implementation in C++.☆151Updated last year
- ☆27Updated last year
- ☆130Updated last year
- ☆71Updated 10 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆226Updated 3 weeks ago
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆84Updated 2 years ago
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆107Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Updated 11 months ago
- ☆155Updated 11 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Updated last year
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆71Updated 4 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Updated 8 months ago
- ☆26Updated 11 months ago
- ☆141Updated last year
- Compiler Infrastructure for Neural Networks☆147Updated 2 years ago
- ☆23Updated this week