intel / neural-speedLinks

An innovative library for efficient LLM inference via low-bit quantization

☆349

Alternatives and similar repositories for neural-speed

Users that are interested in neural-speed are comparing it to the libraries listed below

Sorting:

neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated 9 months ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…
☆564Updated last week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆856Updated this week
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆307Updated 2 months ago
EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆87Updated last week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆648Updated 3 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
Cornell-RelaxML / quip-sharp
☆549Updated 9 months ago
huggingface / optimum-intel
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
☆481Updated this week
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆225Updated 7 months ago
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆376Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆175Updated 4 months ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆437Updated 6 months ago
mlc-ai / llm-perf-bench
☆120Updated last year
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated 9 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆142Updated this week
npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆209Updated last year
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆870Updated 11 months ago
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆354Updated 6 months ago
neuralmagic / AutoFP8
☆195Updated 3 months ago
apple / ml-recurrent-drafter
☆215Updated 6 months ago
huggingface / inference-benchmarker
Inference server benchmarking tool
☆87Updated 3 months ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆303Updated 2 years ago
vllm-project / guidellm
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆461Updated last week
premAI-io / benchmarks
🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.
☆137Updated last year
triton-inference-server / vllm_backend
☆286Updated this week
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆698Updated 11 months ago
huggingface / optimum-habana
Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
☆191Updated this week
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆198Updated last year