triton-inference-server / stateful_backend
Triton backend for managing the model state tensors automatically in sequence batcher
☆17Updated last year
Alternatives and similar repositories for stateful_backend
Users that are interested in stateful_backend are comparing it to the libraries listed below
Sorting:
- The Triton backend for the ONNX Runtime.☆145Updated this week
- TRITONCACHE implementation of a Redis cache☆13Updated this week
- 🌏 Modular retrievers for zero-shot multilingual IR.☆27Updated last year
- The Triton backend for the PyTorch TorchScript models.☆150Updated last week
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆200Updated 3 weeks ago
- Tutorial on how to convert machine learned models into ONNX☆16Updated 2 years ago
- The core library and APIs implementing the Triton Inference Server.☆130Updated this week
- Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inferen…☆62Updated 3 weeks ago
- A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed tra…☆18Updated 2 years ago
- The Triton backend for TensorRT.☆75Updated this week
- Implementation of "Efficient Multi-vector Dense Retrieval with Bit Vectors", ECIR 2024☆61Updated 7 months ago
- Cortex-compatible model server for Python and TensorFlow☆17Updated 2 years ago
- Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and…☆24Updated last month
- XTR: Rethinking the Role of Token Retrieval in Multi-Vector Retrieval☆51Updated 10 months ago
- Comparing PyTorch, JIT and ONNX for inference with Transformers☆19Updated 4 years ago
- Common source, scripts and utilities shared across all Triton repositories.☆71Updated this week
- Optimizing bit-level Jaccard Index and Population Counts for large-scale quantized Vector Search via Harley-Seal CSA and Lookup Tables☆18Updated this week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆11Updated last year
- Distributed ML Optimizer☆32Updated 3 years ago
- IBM development fork of https://github.com/huggingface/text-generation-inference☆60Updated last week
- Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake usin…☆25Updated 2 months ago
- 🤝 Trade any tensors over the network☆30Updated last year
- Showcase how mxbai-embed-large-v1 can be used to produce binary embedding. Binary embeddings enabled 32x storage savings and 40x faster r…☆18Updated last year
- Trace LLM calls (and others) and visualize them in WandB, as interactive SVG or using a streaming local webapp☆14Updated 3 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆48Updated last week
- vLLM adapter for a TGIS-compatible gRPC server.☆27Updated this week
- MLFlow Deployment Plugin for Ray Serve☆44Updated 3 years ago
- Pre-train Static Word Embeddings☆60Updated last month
- Benchmark for machine learning model online serving (LLM, embedding, Stable-Diffusion, Whisper)☆28Updated last year
- The Triton backend for TensorFlow.☆51Updated last month