intel / neural-speedLinks
An innovative library for efficient LLM inference via low-bit quantization
β349Updated last year
Alternatives and similar repositories for neural-speed
Users that are interested in neural-speed are comparing it to the libraries listed below
Sorting:
- A high-throughput and memory-efficient inference and serving engine for LLMsβ267Updated last year
- ποΈ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Oβ¦β317Updated last month
- Official implementation of Half-Quadratic Quantization (HQQ)β891Updated 3 weeks ago
- Advanced quantization toolkit for LLMs. Native support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Bits and seamless integration with Transformβ¦β712Updated this week
- Comparison of Language Model Inference Enginesβ235Updated 11 months ago
- OpenAI compatible API for TensorRT LLM triton backendβ217Updated last year
- Easy and Efficient Quantization for Transformersβ202Updated 4 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.β180Updated 7 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMsβ93Updated this week
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".β277Updated 2 years ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithmβ666Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ204Updated last week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"β154Updated last year
- β565Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"β387Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β951Updated last year
- β312Updated this week
- β205Updated 6 months ago
- πΉοΈ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.β139Updated last year
- π€ Optimum Intel: Accelerate inference with Intel optimization toolsβ507Updated this week
- A throughput-oriented high-performance serving framework for LLMsβ914Updated 3 weeks ago
- scalable and robust tree-based speculative decoding algorithmβ361Updated 9 months ago
- GPTQ inference Triton kernelβ313Updated 2 years ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β202Updated last year
- A high-performance inference system for large language models, designed for production environments.β482Updated last week
- For releasing code related to compression methods for transformers, accompanying our publicationsβ449Updated 10 months ago
- β218Updated 9 months ago
- β120Updated last year
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β201Updated last week
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantizationβ708Updated last year