kozistr / triton-grpc-proxy-rsLinks
Proxy server for triton gRPC server that inferences embedding model in Rust
☆21Updated last year
Alternatives and similar repositories for triton-grpc-proxy-rs
Users that are interested in triton-grpc-proxy-rs are comparing it to the libraries listed below
Sorting:
- utilities for loading and running text embeddings with onnx☆44Updated 3 months ago
- Tiny inference-only implementation of LLaMA☆92Updated last year
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆139Updated last year
- ☆45Updated 2 years ago
- A high performance batching router optimises max throughput for text inference workload☆16Updated 2 years ago
- Unofficial python bindings for the rust llm library. 🐍❤️🦀☆76Updated 2 years ago
- A stable, fast and easy-to-use inference library with a focus on a sync-to-async API☆45Updated last year
- inference code for mixtral-8x7b-32kseqlen☆104Updated last year
- Optimizing bit-level Jaccard Index and Population Counts for large-scale quantized Vector Search via Harley-Seal CSA and Lookup Tables☆21Updated 6 months ago
- Vector Database with support for late interaction and token level embeddings.☆54Updated 5 months ago
- Chat Markup Language conversation library☆55Updated last year
- ☆135Updated last year
- Full finetuning of large language models without large memory requirements☆94Updated 2 months ago
- Simplex Random Feature attention, in PyTorch☆75Updated 2 years ago
- ⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.☆146Updated last year
- ☆157Updated 2 years ago
- Using modal.com to process FineWeb-edu data☆20Updated 8 months ago
- Let's create synthetic textbooks together :)☆75Updated last year
- ☆198Updated last year
- GPU accelerated client-side embeddings for vector search, RAG etc.☆65Updated 2 years ago
- Training Models Daily☆17Updated last year
- A miniature version of Modal☆21Updated last year
- XTR/WARP (SIGIR'25) is an extremely fast and accurate retrieval engine based on Stanford's ColBERTv2/PLAID and Google DeepMind's XTR.☆173Updated 7 months ago
- an implementation of Self-Extend, to expand the context window via grouped attention☆119Updated last year
- ☆24Updated 10 months ago
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆59Updated last month
- ☆140Updated last year
- The DPAB-α Benchmark☆32Updated 10 months ago
- Your buddy in the (L)LM space.☆64Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆53Updated last year