sasha0552 / vllm-ciLinks
CI scripts designed to build a Pascal-compatible version of vLLM.
☆12Updated last year
Alternatives and similar repositories for vllm-ci
Users that are interested in vllm-ci are comparing it to the libraries listed below
Sorting:
- An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs☆571Updated last week
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆165Updated last year
- The official API server for Exllama. OAI compatible, lightweight, and fast.☆1,090Updated this week
- Web UI for ExLlamaV2☆514Updated 9 months ago
- A fork of vLLM enabling Pascal architecture GPUs☆30Updated 9 months ago
- Large-scale LLM inference engine☆1,591Updated this week
- An extension for oobabooga/text-generation-webui that enables the LLM to search the web☆268Updated this week
- Comparison of Language Model Inference Engines☆235Updated 11 months ago
- The main repository for building Pascal-compatible versions of ML applications and libraries.☆147Updated 2 months ago
- A multimodal, function calling powered LLM webui.☆216Updated last year
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆886Updated this week
- Stable Diffusion and Flux in pure C/C++☆22Updated this week
- ☆85Updated last week
- A fast batching API to serve LLM models☆188Updated last year
- An OpenAI API compatible API for chat with image input and questions about the images. aka Multimodal.☆265Updated 8 months ago
- The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM …☆606Updated 9 months ago
- GPU Power and Performance Manager☆61Updated last year
- Make abliterated models with transformers, easy and fast☆92Updated 7 months ago
- A pipeline parallel training script for LLMs.☆162Updated 6 months ago
- LM inference server implementation based on *.cpp.☆290Updated 3 months ago
- ☆565Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆131Updated last year
- The RunPod worker template for serving our large language model endpoints. Powered by vLLM.☆380Updated this week
- Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compati…☆153Updated this week
- Your Trusty Memory-enabled AI Companion - Simple RAG chatbot optimized for local LLMs | 12 Languages Supported | OpenAI API Compatible☆342Updated 8 months ago
- vLLM for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60☆327Updated last month
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆112Updated last week
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)☆746Updated this week
- Efficient 3bit/4bit quantization of LLaMA models☆19Updated 2 years ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆951Updated last year