chu-tianxiang / vllm-gptqView external linksLinks
A high-throughput and memory-efficient inference and serving engine for LLMs
☆132Jun 25, 2024Updated last year
Alternatives and similar repositories for vllm-gptq
Users that are interested in vllm-gptq are comparing it to the libraries listed below
Sorting:
- Enhanced version of original AutoGPTQ (https://github.com/PanQiWei/AutoGPTQ).☆10Nov 2, 2023Updated 2 years ago
- QuIP quantization☆62Mar 17, 2024Updated last year
- setup the env for vllm users☆16Oct 31, 2023Updated 2 years ago
- ubuntu 系统下 GLM-4-Voice 部署经验分享☆18Oct 31, 2024Updated last year
- Advanced Coding AI Assistant that uses a Gradio interface to stream coding related responses. ChatRAG supports local and API inference an…☆23May 6, 2025Updated 9 months ago
- accelerate generating vector by using onnx model☆18Jan 23, 2024Updated 2 years ago
- An OpenAI API compatible LLM inference server based on ExLlamaV2.☆25Feb 9, 2024Updated 2 years ago
- An implementation of MSSRM method☆11Mar 23, 2023Updated 2 years ago
- Yet another frontend for LLM, written using .NET and WinUI 3☆10Sep 14, 2025Updated 5 months ago
- A sd-webui extension for utilizing DanTagGen to "upsample prompts".☆13Jun 13, 2024Updated last year
- ☆15May 23, 2024Updated last year
- human in the loop in dify workflow by plugin☆14Jan 7, 2025Updated last year
- Bella Openapi 实现了Claude Code依赖的 /v1/messsages 接口。所有在Bella-Openapi中接入的LLM协议均可使用Claude Code,不仅仅支持Claude系列模型,同时支持了Openai全系列、Gemini、DeepSeek、…☆16Nov 24, 2025Updated 2 months ago
- Attend - to what matters.☆17Feb 22, 2025Updated 11 months ago
- Boosting Natural Language Generation from Instructions with Meta-Learning☆11Dec 20, 2022Updated 3 years ago
- LLM智能路由网关、 Enterprise Intelligent AI-API Distribution Gateway☆13Jan 24, 2025Updated last year
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.☆5,026Apr 11, 2025Updated 10 months ago
- ☆24Apr 9, 2024Updated last year
- ☆96Nov 6, 2024Updated last year
- Deploy ChatGLM on Modelz☆16Mar 20, 2023Updated 2 years ago
- Toy O☆16Sep 21, 2024Updated last year
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,313May 11, 2025Updated 9 months ago
- fast-embeddings-api☆16Nov 23, 2023Updated 2 years ago
- A Python implementation of the Sequential Thinking MCP server using the official Model Context Protocol (MCP) Python SDK. This server fac…☆24Jun 1, 2025Updated 8 months ago
- ☆19Jan 19, 2026Updated 3 weeks ago
- CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter☆22May 28, 2025Updated 8 months ago
- MMLU eval for RU/EN☆15Jul 31, 2023Updated 2 years ago
- Categorize credit card transactions using a local large language model similar to GPT3☆15Dec 29, 2023Updated 2 years ago
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,445Dec 9, 2025Updated 2 months ago
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs.☆7,606Updated this week
- Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, X…☆2,465Sep 26, 2024Updated last year
- 活体检测顶会论文及复习代码汇总☆13Apr 23, 2021Updated 4 years ago
- Yet Another Papers With Code☆35Sep 7, 2025Updated 5 months ago
- Large Multimodal Model☆15Apr 8, 2024Updated last year
- Large-scale exact string matching tool☆17Mar 7, 2025Updated 11 months ago
- This is an LLM interface that you can use to analyze and get insight into diary entries or other documents completely offline.☆16Dec 31, 2023Updated 2 years ago
- Measuring RAG solutions throughput and latency☆19Jul 23, 2024Updated last year
- The one who calls upon functions - Function-Calling Language Model☆36Oct 2, 2023Updated 2 years ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆140Dec 6, 2024Updated last year