XiongjieDai / GPU-Benchmarks-on-LLM-InferenceLinks
Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
☆1,877Updated last year
Alternatives and similar repositories for GPU-Benchmarks-on-LLM-Inference
Users that are interested in GPU-Benchmarks-on-LLM-Inference are comparing it to the libraries listed below
Sorting:
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,440Updated 2 months ago
- Large-scale LLM inference engine☆1,647Updated 3 weeks ago
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆2,108Updated last week
- Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc☆2,374Updated this week
- NVIDIA Linux open GPU with P2P support☆1,320Updated 8 months ago
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization☆1,389Updated last year
- The official API server for Exllama. OAI compatible, lightweight, and fast.☆1,129Updated this week
- This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?☆1,445Updated 2 months ago
- Create Custom LLMs☆1,806Updated 3 months ago
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆2,661Updated this week
- llama.cpp fork with additional SOTA quants and improved performance☆1,605Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,705Updated this week
- Enforce the output format (JSON Schema, Regex etc) of a language model☆1,986Updated 5 months ago
- LLMPerf is a library for validating and benchmarking LLMs☆1,084Updated last year
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆3,718Updated 8 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆843Updated this week
- Tools for merging pretrained large language models.☆6,783Updated 2 weeks ago
- LM Studio Apple MLX engine☆890Updated this week
- Optimizing inference proxy for LLMs☆3,317Updated 2 weeks ago
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,314Updated 9 months ago
- An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs☆626Updated 2 weeks ago
- Implements harmful/harmless refusal removal using pure HF Transformers☆1,485Updated 2 months ago
- ☆1,193Updated last month
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆3,084Updated 2 weeks ago
- Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.☆2,822Updated last week
- Comparison of Language Model Inference Engines☆239Updated last year
- llama and other large language models on iOS and MacOS offline using GGML library.☆1,968Updated last week
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,897Updated 2 years ago
- LLM Benchmark for Throughput via Ollama (Local LLMs)☆331Updated 3 weeks ago
- The main repository for building Pascal-compatible versions of ML applications and libraries.☆169Updated 5 months ago