sgl-project / mini-sglangLinks
☆1,087Updated this week
Alternatives and similar repositories for mini-sglang
Users that are interested in mini-sglang are comparing it to the libraries listed below
Sorting:
- Perplexity GPU Kernels☆539Updated last month
- kernels, of the mega variety☆631Updated 2 months ago
- ☆610Updated this week
- Materials for learning SGLang☆682Updated 2 weeks ago
- LLM KV cache compression made easy☆717Updated this week
- A Quirky Assortment of CuTe Kernels☆687Updated last week
- Puzzles for learning Triton, play it with minimal environment configuration!☆571Updated 2 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA (+ more DSLs)☆708Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆557Updated this week
- Cataloging released Triton kernels.☆277Updated 3 months ago
- JAX backend for SGL☆200Updated this week
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆256Updated last week
- Distributed Compiler based on Triton for Parallel Systems☆1,280Updated this week
- Allow torch tensor memory to be released and resumed later☆187Updated 2 weeks ago
- A throughput-oriented high-performance serving framework for LLMs☆923Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttention☆448Updated 6 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆864Updated last week
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆247Updated last week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆244Updated 7 months ago
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆469Updated last month
- Zero Bubble Pipeline Parallelism☆440Updated 7 months ago
- ☆262Updated last week
- a minimal cache manager for PagedAttention, on top of llama3.☆127Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆679Updated last week
- Ring attention implementation with flash attention☆944Updated 3 months ago
- Efficient LLM Inference over Long Sequences☆393Updated 5 months ago
- ☆937Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆301Updated this week
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆539Updated 3 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆297Updated 6 months ago