Snowflake-Labs / vllmLinks
☆15Updated 4 months ago
Alternatives and similar repositories for vllm
Users that are interested in vllm are comparing it to the libraries listed below
Sorting:
- Benchmark suite for LLMs from Fireworks.ai☆76Updated this week
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆190Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- IBM development fork of https://github.com/huggingface/text-generation-inference☆61Updated 2 months ago
- A collection of reproducible inference engine benchmarks☆32Updated 3 months ago
- ☆45Updated last year
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆203Updated this week
- Google TPU optimizations for transformers models☆117Updated 6 months ago
- The backend behind the LLM-Perf Leaderboard☆10Updated last year
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆78Updated last week
- Cray-LM unified training and inference stack.☆22Updated 6 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆131Updated last year
- ☆31Updated 8 months ago
- vLLM adapter for a TGIS-compatible gRPC server.☆33Updated this week
- Easy and Efficient Quantization for Transformers☆198Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆265Updated 9 months ago
- Code repository for the paper - "AdANNS: A Framework for Adaptive Semantic Search"☆65Updated last year
- A Lossless Compression Library for AI pipelines☆272Updated last month
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆141Updated this week
- ☆76Updated last month
- experiments with inference on llama☆104Updated last year
- Code for NeurIPS LLM Efficiency Challenge☆59Updated last year
- Train, tune, and infer Bamba model☆130Updated 2 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆123Updated 8 months ago
- XTR/WARP (SIGIR'25) is an extremely fast and accurate retrieval engine based on Stanford's ColBERTv2/PLAID and Google DeepMind's XTR.☆152Updated 3 months ago
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.☆33Updated 2 months ago
- 🤝 Trade any tensors over the network☆30Updated last year
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆15Updated this week
- ☆48Updated 11 months ago