SkyworkAI / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆15Updated 5 months ago
Related projects ⓘ
Alternatives and complementary repositories for vllm
- OneFlow Serving☆20Updated 9 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆25Updated 2 weeks ago
- ☆35Updated 2 weeks ago
- ☆11Updated last year
- Transformer related optimization, including BERT, GPT☆17Updated last year
- An easy way to run, test, benchmark and tune OpenCL kernel files☆23Updated last year
- OneFlow->ONNX☆42Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆20Updated 8 months ago
- NVIDIA TensorRT Hackathon 2023复赛选题:通义千问Qwen-7B用TensorRT-LLM模型搭建及优化☆40Updated last year
- llm deploy project based onnx.☆26Updated last month
- ☆13Updated 7 months ago
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆44Updated last year
- ☆18Updated 10 months ago
- ☆23Updated last year
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆24Updated last week
- 天池 NVIDIA TensorRT Hackathon 2023 —— 生成式AI模型优化赛 初赛第三名方案☆47Updated last year
- vLLM performance dashboard☆18Updated 6 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated 11 months ago
- ☆11Updated 9 months ago
- Whisper in TensorRT-LLM☆14Updated last year
- 大模型部署实战:TensorRT-LLM, Triton Inference Server, vLLM☆26Updated 8 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- GPTQ inference TVM kernel☆36Updated 6 months ago
- ☆22Updated last year
- Datasets, Transforms and Models specific to Computer Vision☆82Updated last year
- ☆18Updated 9 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- ☆19Updated last month