XiongjieDai / GPU-Benchmarks-on-LLM-InferenceLinks
Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?
☆1,633Updated last year
Alternatives and similar repositories for GPU-Benchmarks-on-LLM-Inference
Users that are interested in GPU-Benchmarks-on-LLM-Inference are comparing it to the libraries listed below
Sorting:
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,196Updated last week
- Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.☆2,074Updated last month
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆2,176Updated 3 weeks ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,417Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆922Updated 5 months ago
- Large-scale LLM inference engine☆1,435Updated last week
- This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?☆1,117Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration☆3,041Updated 3 weeks ago
- Tools for merging pretrained large language models.☆5,774Updated this week
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization☆1,309Updated 6 months ago
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs☆2,989Updated 2 weeks ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆590Updated last week
- FlashInfer: Kernel Library for LLM Serving☆3,088Updated this week
- llama and other large language models on iOS and MacOS offline using GGML library.☆1,783Updated 2 months ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.☆2,878Updated last year
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,170Updated 7 months ago
- The official API server for Exllama. OAI compatible, lightweight, and fast.☆969Updated this week
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆1,314Updated this week
- Go ahead and axolotl questions☆9,506Updated this week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆639Updated last month
- Enforce the output format (JSON Schema, Regex etc) of a language model☆1,818Updated 3 months ago
- Chat language model that can use tools and interpret the results☆1,555Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆815Updated 3 weeks ago
- A Datacenter Scale Distributed Inference Serving Framework☆4,136Updated this week
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆331Updated last month
- Python bindings for the Transformer models implemented in C/C++ using GGML library.☆1,866Updated last year
- ☆895Updated 8 months ago
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".☆2,119Updated last year
- SGLang is a fast serving framework for large language models and vision language models.☆14,814Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆818Updated this week