sgl-project / awesome-sglangLinks
Make SGLang go brrr
☆30Updated last week
Alternatives and similar repositories for awesome-sglang
Users that are interested in awesome-sglang are comparing it to the libraries listed below
Sorting:
- ☆50Updated 4 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆43Updated 2 months ago
- ☆78Updated 5 months ago
- ☆64Updated 4 months ago
- DeeperGEMM: crazy optimized version☆70Updated 4 months ago
- ☆95Updated 5 months ago
- ☆97Updated 4 months ago
- Quantized Attention on GPU☆44Updated 10 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆59Updated 10 months ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆62Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆82Updated this week
- ☆126Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆127Updated 9 months ago
- Utility scripts for PyTorch (e.g. Memory profiler that understands more low-level allocations such as NCCL)☆53Updated last week
- PyTorch bindings for CUTLASS grouped GEMM.☆119Updated 3 months ago
- ☆107Updated last month
- JAX backend for SGL☆60Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆48Updated last year
- ☆38Updated last month
- A simple calculation for LLM MFU.☆45Updated last week
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆211Updated 2 weeks ago
- Estimate MFU for DeepSeekV3☆24Updated 8 months ago
- Allow torch tensor memory to be released and resumed later☆135Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆77Updated last year
- An experimental communicating attention kernel based on DeepEP.☆34Updated last month
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆43Updated 3 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆142Updated 4 months ago
- LLM Serving Performance Evaluation Harness☆79Updated 6 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆116Updated 4 months ago
- Fast and memory-efficient exact attention☆93Updated last week