intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆124Updated last week
Alternatives and similar repositories for llm-on-ray:
Users that are interested in llm-on-ray are comparing it to the libraries listed below
- ☆54Updated 6 months ago
- Efficient and easy multi-instance LLM serving☆367Updated this week
- ☆117Updated last year
- ☆49Updated 4 months ago
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆65Updated last year
- A low-latency & high-throughput serving engine for LLMs☆337Updated 2 months ago
- The driver for LMCache core to run in vLLM☆36Updated 2 months ago
- ☆241Updated this week
- LLM Serving Performance Evaluation Harness☆75Updated last month
- ☆45Updated 9 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated last year
- Materials for learning SGLang☆371Updated 3 weeks ago
- Benchmark suite for LLMs from Fireworks.ai☆70Updated 2 months ago
- Perplexity GPU Kernels☆185Updated last week
- Dynamic Memory Management for Serving LLMs without PagedAttention☆345Updated 2 weeks ago
- ☆185Updated 6 months ago
- A large-scale simulation framework for LLM inference☆359Updated 4 months ago
- NVIDIA NCCL Tests for Distributed Training☆88Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆67Updated this week
- Stateful LLM Serving☆58Updated last month
- A throughput-oriented high-performance serving framework for LLMs☆794Updated 6 months ago
- Modular and structured prompt caching for low-latency LLM inference☆89Updated 5 months ago
- Tune efficiently any LLM model from HuggingFace using distributed training (multiple GPU) and DeepSpeed. Uses Ray AIR to orchestrate the …☆56Updated last year
- PyTorch distributed training acceleration framework☆47Updated 2 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆113Updated last year
- A collection of all available inference solutions for the LLMs☆84Updated last month
- GLake: optimizing GPU memory management and IO transmission.☆453Updated 2 weeks ago
- ☆66Updated 2 weeks ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- ☆412Updated this week