opendatahub-io / vllm-tgis-adapterLinks
vLLM adapter for a TGIS-compatible gRPC server.
β32Updated this week
Alternatives and similar repositories for vllm-tgis-adapter
Users that are interested in vllm-tgis-adapter are comparing it to the libraries listed below
Sorting:
- π· Build compute kernelsβ68Updated this week
- Benchmark suite for LLMs from Fireworks.aiβ76Updated 2 weeks ago
- Inference server benchmarking toolβ74Updated 2 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language modelsβ80Updated last month
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ116Updated 6 months ago
- Hugging Face Inference Toolkit used to serve transformers, sentence-transformers, and diffusers models.β78Updated 2 weeks ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β130Updated this week
- vLLM performance dashboardβ30Updated last year
- Load compute kernels from the Hubβ191Updated this week
- β159Updated this week
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.β33Updated last month
- β34Updated last month
- Train, tune, and infer Bamba modelβ127Updated 3 weeks ago
- β41Updated 2 weeks ago
- β53Updated last year
- β35Updated last year
- Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTOβ¦β55Updated this week
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS β¦β59Updated 8 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ129Updated this week
- β74Updated 7 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLangβ53Updated 7 months ago
- IBM development fork of https://github.com/huggingface/text-generation-inferenceβ60Updated last month
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.β36Updated last year
- patches for huggingface transformers to save memoryβ23Updated 3 weeks ago
- Data preparation code for CrystalCoder 7B LLMβ45Updated last year
- Simple high-throughput inference libraryβ119Updated last month
- Code for KaLM-Embedding modelsβ78Updated 3 months ago
- Repository for CPU Kernel Generation for LLM Inferenceβ26Updated last year
- Google TPU optimizations for transformers modelsβ113Updated 5 months ago
- Benchmark for machine learning model online serving (LLM, embedding, Stable-Diffusion, Whisper)β28Updated last year