menloresearch / cortex.tensorrt-llmLinks
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
☆43Updated 9 months ago
Alternatives and similar repositories for cortex.tensorrt-llm
Users that are interested in cortex.tensorrt-llm are comparing it to the libraries listed below
Sorting:
- cortex.llamacpp is a high-efficiency C++ inference engine for edge computing. It is a dynamic library that can be loaded by any server a…☆42Updated 3 weeks ago
- llama.cpp fork used by GPT4All☆55Updated 4 months ago
- LLM inference in C/C++☆21Updated 3 months ago
- ☆95Updated 6 months ago
- Gradio based tool to run opensource LLM models directly from Huggingface☆93Updated 11 months ago
- Tcurtsni: Reverse Instruction Chat, ever wonder what your LLM wants to ask you?☆22Updated last year
- ☆114Updated 6 months ago
- ☆66Updated last year
- AirLLM 70B inference with single 4GB GPU☆13Updated last week
- After my server ui improvements were successfully merged, consider this repo a playground for experimenting, tinkering and hacking around…☆54Updated 10 months ago
- automatically quant GGUF models☆184Updated last week
- ☆53Updated last year
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- ☆24Updated 5 months ago
- Local LLM inference & management server with built-in OpenAI API☆31Updated last year
- run ollama & gguf easily with a single command☆51Updated last year
- Testing LLM reasoning abilities with family relationship quizzes.☆62Updated 4 months ago
- B-Llama3o a llama3 with Vision Audio and Audio understanding as well as text and Audio and Animation Data output.☆26Updated last year
- Serving LLMs in the HF-Transformers format via a PyFlask API☆71Updated 9 months ago
- A Windows tool to query various LLM AIs. Supports branched conversations, history and summaries among others.☆31Updated this week
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆31Updated 2 months ago
- ☆124Updated 2 months ago
- Cohere Toolkit is a collection of prebuilt components enabling users to quickly build and deploy RAG applications.☆28Updated 5 months ago
- LLM inference in C/C++☆77Updated this week
- ☆157Updated 11 months ago
- A proxy that hosts multiple single-model runners such as LLama.cpp and vLLM☆11Updated 3 weeks ago
- Lightweight continuous batching OpenAI compatibility using HuggingFace Transformers include T5 and Whisper.☆24Updated 3 months ago
- This extension enhances the capabilities of textgen-webui by integrating advanced vision models, allowing users to have contextualized co…☆54Updated 8 months ago
- This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows inste…☆122Updated last year
- GPT-4 Level Conversational QA Trained In a Few Hours☆62Updated 10 months ago