XiongjieDai / GPU-Benchmarks-on-LLM-Inference
Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
☆1,063Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for GPU-Benchmarks-on-LLM-Inference
- Large-scale LLM inference engine☆1,140Updated this week
- Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and in…☆1,506Updated this week
- A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations☆737Updated last week
- Enforce the output format (JSON Schema, Regex etc) of a language model☆1,553Updated last month
- ☆722Updated 2 months ago
- Optimizing inference proxy for LLMs☆1,582Updated this week
- This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?☆732Updated 3 weeks ago
- The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM …☆494Updated 3 months ago
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆1,473Updated this week
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization☆1,152Updated 2 weeks ago
- An OAI compatible exllamav2 API that's both lightweight and fast☆605Updated this week
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆1,634Updated this week
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆1,765Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆691Updated this week
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆3,680Updated this week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆526Updated this week
- Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.☆1,033Updated 2 weeks ago
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆2,205Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆645Updated 3 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- NVIDIA Linux open GPU with P2P support☆914Updated 5 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆702Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆637Updated 2 months ago
- SGLang is a fast serving framework for large language models and vision language models.☆6,127Updated this week
- Chat language model that can use tools and interpret the results☆1,429Updated last week
- Serving multiple LoRA finetuned LLM as one☆986Updated 6 months ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,755Updated 10 months ago
- Manage GPU clusters for running LLMs☆646Updated this week
- ☆505Updated 3 weeks ago