[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆39Feb 29, 2024Updated 2 years ago
Alternatives and similar repositories for vllm-ra
Users that are interested in vllm-ra are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆85Apr 18, 2025Updated last year
- ☆133Nov 11, 2024Updated last year
- Prefix-Aware Attention for LLM Decoding☆37Mar 31, 2026Updated last month
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆18Mar 31, 2026Updated last month
- ☆16Mar 3, 2024Updated 2 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆148Dec 4, 2024Updated last year
- This is the official repo for the paper "Accelerating Parallel Sampling of Diffusion Models" Tang et al. ICML 2024 https://openreview.net…☆16Jul 19, 2024Updated last year
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models☆78Mar 12, 2024Updated 2 years ago
- Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation☆41May 11, 2026Updated last week
- ☆157Mar 4, 2025Updated last year
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆218Sep 21, 2024Updated last year
- ☆96Dec 6, 2024Updated last year
- Code of "To The Point: Correspondence-driven monocular 3D category reconstruction (TTP)" Neurips 2021☆11Jan 24, 2022Updated 4 years ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training