menloresearch / cortex.tensorrt-llmLinks
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
☆43Updated 11 months ago
Alternatives and similar repositories for cortex.tensorrt-llm
Users that are interested in cortex.tensorrt-llm are comparing it to the libraries listed below
Sorting:
- ☆95Updated last week
- 1.58-bit LLaMa model☆82Updated last year
- A fast batching API to serve LLM models☆185Updated last year
- This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows inste…☆127Updated last year
- Lightweight continuous batching OpenAI compatibility using HuggingFace Transformers include T5 and Whisper.☆26Updated 5 months ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.☆64Updated last year
- Tcurtsni: Reverse Instruction Chat, ever wonder what your LLM wants to ask you?☆22Updated last year
- automatically quant GGUF models☆196Updated this week
- An OpenAI API compatible LLM inference server based on ExLlamaV2.☆25Updated last year
- LLM inference in C/C++☆101Updated this week
- Easily view and modify JSON datasets for large language models☆81Updated 3 months ago
- llama.cpp fork used by GPT4All☆56Updated 6 months ago
- ☆51Updated last year
- Testing LLM reasoning abilities with family relationship quizzes.☆63Updated 7 months ago
- LLaMA Server combines the power of LLaMA C++ with the beauty of Chatbot UI.☆129Updated 2 years ago
- run ollama & gguf easily with a single command☆52Updated last year
- Low-Rank adapter extraction for fine-tuned transformers models☆175Updated last year
- Something similar to Apple Intelligence?☆61Updated last year
- Unsloth Studio☆101Updated 4 months ago
- ☆67Updated last year
- AirLLM 70B inference with single 4GB GPU☆14Updated 2 months ago
- Falcon LLM ggml framework with CPU and GPU support☆247Updated last year
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆42Updated last month
- ☆116Updated 8 months ago
- ☆50Updated 6 months ago
- cli tool to quantize gguf, gptq, awq, hqq and exl2 models☆75Updated 8 months ago
- Serving LLMs in the HF-Transformers format via a PyFlask API☆71Updated 11 months ago
- An unsupervised model merging algorithm for Transformers-based language models.☆106Updated last year
- Some simple scripts that I use day-to-day when working with LLMs and Huggingface Hub☆162Updated last year
- This is our own implementation of 'Layer Selective Rank Reduction'