NVIDIA / trt-llm-as-openai-windows
This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows instead of cloud.
☆116Updated 8 months ago
Related projects ⓘ
Alternatives and complementary repositories for trt-llm-as-openai-windows
- ☆120Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last month
- A pipeline parallel training script for LLMs.☆83Updated this week
- automatically quant GGUF models☆140Updated this week
- This is our own implementation of 'Layer Selective Rank Reduction'☆232Updated 5 months ago
- ☆64Updated 5 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- The NVIDIA RTX™ AI Toolkit is a suite of tools and SDKs for Windows developers to customize, optimize, and deploy AI models across RTX PC…☆115Updated this week
- Low-Rank adapter extraction for fine-tuned transformers model☆162Updated 6 months ago
- Gradio based tool to run opensource LLM models directly from Huggingface☆87Updated 4 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆173Updated 4 months ago
- A compact LLM pretrained in 9 days by using high quality data☆264Updated 2 months ago
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆134Updated 3 months ago
- A fast batching API to serve LLM models☆172Updated 6 months ago
- Some simple scripts that I use day-to-day when working with LLMs and Huggingface Hub☆155Updated last year
- OpenAI compatible API for TensorRT LLM triton backend☆177Updated 3 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- ☆150Updated 4 months ago
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆126Updated 6 months ago
- ☆106Updated 2 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- ☆93Updated last month
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆167Updated 2 weeks ago
- An Open Source Toolkit For LLM Distillation☆358Updated 2 months ago
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆126Updated this week
- ☆104Updated 8 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆113Updated 3 weeks ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- An unsupervised model merging algorithm for Transformers-based language models.☆100Updated 6 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆229Updated 3 weeks ago