jd-opensource / xllmLinks
A high-performance inference engine for LLMs, optimized for diverse AI accelerators.
☆518Updated last week
Alternatives and similar repositories for xllm
Users that are interested in xllm are comparing it to the libraries listed below
Sorting:
- AI Infra主要是指AI的基础建设,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术。☆247Updated last year
- KV cache store for distributed LLM inference☆341Updated last month
- ☆503Updated last month
- ☆70Updated 11 months ago
- PyTorch distributed training acceleration framework☆52Updated last month
- Efficient and easy multi-instance LLM serving☆494Updated last month
- SGLang kernel library for NPU☆59Updated 2 weeks ago
- GLake: optimizing GPU memory management and IO transmission.☆479Updated 6 months ago
- ☆75Updated 10 months ago
- A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.☆56Updated 2 weeks ago
- Fast and memory-efficient exact attention☆94Updated this week
- Accelerate inference without tears☆333Updated 2 weeks ago
- Materials for learning SGLang☆597Updated last week
- Venus Collective Communication Library, supported by SII and Infrawaves.☆95Updated this week
- ☆91Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆417Updated this week
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆874Updated last week
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆111Updated 4 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆96Updated 2 years ago
- High performance Transformer implementation in C++.☆135Updated 8 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆197Updated 3 weeks ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆220Updated 2 months ago
- Perplexity GPU Kernels☆482Updated 3 weeks ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆265Updated last month
- A LLM semantic caching system aiming to enhance user experience by reducing response time via cached query-result pairs.☆952Updated 3 months ago
- Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…☆73Updated last week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆265Updated 2 months ago
- A Fully Self-Hosted Solution for Full-Duplex Voice Interaction☆254Updated 2 weeks ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆85Updated 5 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆700Updated 6 months ago