jd-opensource / xllm-serviceLinks
A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.
☆86Updated 2 weeks ago
Alternatives and similar repositories for xllm-service
Users that are interested in xllm-service are comparing it to the libraries listed below
Sorting:
- ☆34Updated last year
- FlagCX is a scalable and adaptive cross-chip communication library.☆172Updated this week
- ☆152Updated last year
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Updated last month
- PyTorch distributed training acceleration framework☆55Updated 5 months ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Updated 2 weeks ago
- ☆130Updated last year
- ☆47Updated last year
- ☆27Updated last year
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆73Updated 9 months ago
- High Performance LLM Inference Operator Library☆695Updated this week
- ☆96Updated 10 months ago
- Fast and memory-efficient exact attention☆114Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆298Updated 3 weeks ago
- ☆141Updated last year
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆226Updated 3 weeks ago
- High performance Transformer implementation in C++.☆150Updated last year
- ☆76Updated last year
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆107Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Updated last year
- ☆73Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Updated 4 months ago
- ☆60Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆64Updated last year
- A GPU-driven system framework for scalable AI applications☆124Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Updated 11 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆99Updated 2 years ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148Updated 9 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆71Updated 4 months ago
- FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-lang…☆200Updated this week