janhq / cortex.tensorrt-llm
Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.
☆39Updated last month
Related projects ⓘ
Alternatives and complementary repositories for cortex.tensorrt-llm
- A fast batching API to serve LLM models☆172Updated 6 months ago
- A pipeline parallel training script for LLMs.☆83Updated 3 weeks ago
- All the world is a play, we are but actors in it.☆47Updated 4 months ago
- automatically quant GGUF models☆137Updated this week
- idea: https://github.com/nyxkrage/ebook-groupchat/☆81Updated 2 months ago
- Distributed Inference for mlx LLm☆68Updated 3 months ago
- ☆103Updated 7 months ago
- An extension that lets the AI take the wheel, allowing it to use the mouse and keyboard, recognize UI elements, and prompt itself :3...no…☆95Updated 2 weeks ago
- For inferring and serving local LLMs using the MLX framework☆89Updated 7 months ago
- Large Model Proxy is designed to make it easy to run multiple resource-heavy Large Models (LM) on the same machine with limited amount of…☆45Updated last month
- Realtime tts reading of large textfiles by your favourite voice. +Translation via LLM (Python script)☆47Updated 3 weeks ago
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.☆51Updated last week
- ☆148Updated 3 months ago
- Local LLM inference & management server with built-in OpenAI API☆31Updated 6 months ago
- ☆95Updated last week
- Something similar to Apple Intelligence?☆57Updated 4 months ago
- After my server ui improvements were successfully merged, consider this repo a playground for experimenting, tinkering and hacking around…☆56Updated 2 months ago
- MLX-Embeddings is the best package for running Vision and Language Embedding models locally on your Mac using MLX.☆75Updated 3 weeks ago
- Gradio based tool to run opensource LLM models directly from Huggingface☆87Updated 4 months ago
- Easily view and modify JSON datasets for large language models☆62Updated last month
- ☆63Updated last month
- Serving LLMs in the HF-Transformers format via a PyFlask API☆68Updated last month
- A python application that routes incoming prompts to an LLM by category, and can support a single incoming connection from a front end to…☆160Updated last week
- HTTP proxy for on-demand model loading with llama.cpp (or other OpenAI compatible backends)☆32Updated last week
- Cohere Toolkit is a collection of prebuilt components enabling users to quickly build and deploy RAG applications.☆25Updated last week
- run ollama & gguf easily with a single command☆47Updated 5 months ago
- Dagger functions to import Hugging Face GGUF models into a local ollama instance and optionally push them to ollama.com.☆110Updated 5 months ago
- ☆25Updated last month
- A simple experiment on letting two local LLM have a conversation about anything!☆91Updated 4 months ago
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆126Updated 5 months ago