runpod-workers / worker-sglang
SGLang is fast serving framework for large language models and vision language models.
☆16Updated 3 weeks ago
Alternatives and similar repositories for worker-sglang:
Users that are interested in worker-sglang are comparing it to the libraries listed below
- Google TPU optimizations for transformers models☆90Updated last week
- ☆52Updated 7 months ago
- ☆74Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆80Updated last week
- an implementation of Self-Extend, to expand the context window via grouped attention☆118Updated last year
- ☆154Updated last week
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆115Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆105Updated last month
- Benchmark suite for LLMs from Fireworks.ai☆64Updated last month
- Full finetuning of large language models without large memory requirements☆93Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆190Updated 6 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆185Updated last month
- A collection of all available inference solutions for the LLMs☆76Updated 4 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆34Updated 9 months ago
- Low-Rank adapter extraction for fine-tuned transformers models☆167Updated 8 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆31Updated 8 months ago
- Train, tune, and infer Bamba model☆80Updated 2 weeks ago
- Utils for Unsloth☆29Updated last week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆58Updated last month
- Let's create synthetic textbooks together :)☆73Updated last year
- ☆110Updated 4 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆257Updated 3 months ago
- ☆74Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- The official repo for "LLoCo: Learning Long Contexts Offline"☆114Updated 7 months ago
- Easy to use, High Performant Knowledge Distillation for LLMs☆40Updated 2 weeks ago
- ☆65Updated 8 months ago
- Data preparation code for Amber 7B LLM☆84Updated 8 months ago
- Fast approximate inference on a single GPU with sparsity aware offloading☆38Updated last year
- An OpenAI Completions API compatible server for NLP transformers models☆63Updated last year