menloresearch / cortex.tensorrt-llm
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
☆43Updated 5 months ago
Alternatives and similar repositories for cortex.tensorrt-llm:
Users that are interested in cortex.tensorrt-llm are comparing it to the libraries listed below
- ☆81Updated 3 months ago
- run ollama & gguf easily with a single command☆49Updated 10 months ago
- ☆24Updated 2 months ago
- Tcurtsni: Reverse Instruction Chat, ever wonder what your LLM wants to ask you?☆21Updated 8 months ago
- Lightweight continuous batching OpenAI compatibility using HuggingFace Transformers include T5 and Whisper.☆20Updated last week
- Gradio based tool to run opensource LLM models directly from Huggingface☆91Updated 8 months ago
- 1.58-bit LLaMa model☆82Updated 11 months ago
- Easily view and modify JSON datasets for large language models☆71Updated 3 weeks ago
- ☆111Updated 3 months ago
- An OpenAI API compatible LLM inference server based on ExLlamaV2.☆25Updated last year
- Experimental LLM Inference UX to aid in creative writing☆113Updated 3 months ago
- Deploy your GGML models to HuggingFace Spaces with Docker and gradio☆36Updated last year
- ☆53Updated 9 months ago
- ☆66Updated 9 months ago
- An API for VoiceCraft.☆25Updated 8 months ago
- A fast batching API to serve LLM models☆182Updated 10 months ago
- After my server ui improvements were successfully merged, consider this repo a playground for experimenting, tinkering and hacking around…☆56Updated 7 months ago
- idea: https://github.com/nyxkrage/ebook-groupchat/☆86Updated 7 months ago
- Serving LLMs in the HF-Transformers format via a PyFlask API☆71Updated 6 months ago
- automatically quant GGUF models☆163Updated this week
- All the world is a play, we are but actors in it.☆47Updated this week
- Easy to use, High Performant Knowledge Distillation for LLMs☆54Updated this week
- A Windows tool to query various LLM AIs. Supports branched conversations, history and summaries among others.☆29Updated this week
- A pipeline parallel training script for LLMs.☆132Updated this week
- Accepts a Hugging Face model URL, automatically downloads and quantizes it using Bits and Bytes.☆38Updated last year
- A proxy that hosts multiple single-model runners such as LLama.cpp and vLLM☆12Updated 2 weeks ago
- ☆31Updated last year
- Create text chunks which end at natural stopping points without using a tokenizer☆26Updated last week
- entropix style sampling + GUI☆25Updated 4 months ago