NVIDIA / trt-llm-as-openai-windows
This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows instead of cloud.
☆119Updated 11 months ago
Alternatives and similar repositories for trt-llm-as-openai-windows:
Users that are interested in trt-llm-as-openai-windows are comparing it to the libraries listed below
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆191Updated this week
- The NVIDIA RTX™ AI Toolkit is a suite of tools and SDKs for Windows developers to customize, optimize, and deploy AI models across RTX PC…☆141Updated 3 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆351Updated 5 months ago
- A pipeline parallel training script for LLMs.☆124Updated 3 weeks ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- ☆53Updated 8 months ago
- ☆65Updated 8 months ago
- Data preparation code for Amber 7B LLM☆85Updated 9 months ago
- Fast parallel LLM inference for MLX☆163Updated 7 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 4 months ago
- ☆113Updated 4 months ago
- Convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) with several quantization techniques (GTPQ, bitsandbytes…☆147Updated last year
- This is our own implementation of 'Layer Selective Rank Reduction'☆233Updated 8 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆191Updated 7 months ago
- automatically quant GGUF models☆154Updated this week
- Low-Rank adapter extraction for fine-tuned transformers models☆169Updated 9 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆203Updated 3 months ago
- Adapted version of llama3.np (NumPy) to a CuPy implementation for the Llama 3 model.☆36Updated 9 months ago
- ☆123Updated 6 months ago
- an implementation of Self-Extend, to expand the context window via grouped attention☆118Updated last year
- ☆109Updated 5 months ago
- OpenAI compatible API for TensorRT LLM triton backend☆193Updated 6 months ago
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆136Updated 6 months ago
- experiments with inference on llama☆104Updated 8 months ago
- Gradio based tool to run opensource LLM models directly from Huggingface☆91Updated 7 months ago
- ☆98Updated 5 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆288Updated 3 weeks ago
- ☆161Updated this week
- A fast batching API to serve LLM models☆180Updated 9 months ago
- Easy and Efficient Quantization for Transformers☆193Updated 2 weeks ago