sasha0552 / vllm-ciLinks

CI scripts designed to build a Pascal-compatible version of vLLM.

☆12

Alternatives and similar repositories for vllm-ci

Users that are interested in vllm-ci are comparing it to the libraries listed below

Sorting:

sasha0552 / pascal-pkgs-ci
The main repository for building Pascal-compatible versions of ML applications and libraries.
☆95Updated last month
cduk / vllm-pascal
A fork of vLLM enabling Pascal architecture GPUs
☆28Updated 4 months ago
matt-c1 / llama-3-quant-comparison
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
☆154Updated last year
turboderp-org / exllamav3
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆419Updated last week
npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆209Updated 10 months ago
gpustack / llama-box
LM inference server implementation based on *.cpp.
☆226Updated this week
mamei16 / LLM_Web_search
An extension for oobabooga/text-generation-webui that enables the LLM to search the web using DuckDuckGo
☆249Updated last week
chu-tianxiang / vllm-gptq
A high-throughput and memory-efficient inference and serving engine for LLMs
☆131Updated last year
epolewski / EricLLM
A fast batching API to serve LLM models
☆183Updated last year
itsme2417 / PolyMind
A multimodal, function calling powered LLM webui.
☆214Updated 9 months ago
tdrussell / qlora-pipe
A pipeline parallel training script for LLMs.
☆150Updated last month
LostRuins / datasetexplorer
Easily view and modify JSON datasets for large language models
☆76Updated last month
leafspark / AutoGGUF
automatically quant GGUF models
☆184Updated this week
thomasgauthier / LoRD
Low-Rank adapter extraction for fine-tuned transformers models
☆173Updated last year
taprosoft / llm_finetuning
Convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) with several quantization techniques (GTPQ, bitsandbytes…
☆147Updated last year
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆238Updated last year
sasha0552 / nvidia-pstated
A daemon that automatically manages the performance states of NVIDIA GPUs.
☆89Updated 2 weeks ago
SicariusSicariiStuff / SLOP_Detector
SLOP Detector and analyzer based on dictionary for shareGPT JSON and text
☆70Updated 7 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆172Updated 2 months ago
Gryphe / BlockMerge_Gradient
Merge Transformers language models by use of gradient parameters.
☆206Updated 10 months ago
ikawrakow / ik_llama.cpp
llama.cpp fork with additional SOTA quants and improved performance
☆608Updated this week
crashr / gppm
GPU Power and Performance Manager
☆59Updated 8 months ago
theroyallab / tabbyAPI
The official API server for Exllama. OAI compatible, lightweight, and fast.
☆990Updated this week
runpod-workers / worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
☆327Updated last week
limcheekin / open-text-embeddings
Open Source Text Embedding Models with OpenAI Compatible API
☆154Updated 11 months ago
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆217Updated 6 months ago
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…
☆633Updated this week
cognitivecomputations / laserRMT
This is our own implementation of 'Layer Selective Rank Reduction'
☆239Updated last year
rombodawg / Easy_training
☆49Updated 4 months ago
SearchSavior / OpenArc
Lightweight Inference server for OpenVINO
☆187Updated last week