microsoft / batch-inference
Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.
☆81Updated last month
Related projects: ⓘ
- ☆108Updated 6 months ago
- ☆170Updated this week
- Easy and Efficient Quantization for Transformers☆172Updated 2 months ago
- experiments with inference on llama☆106Updated 3 months ago
- ☆145Updated last month
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- OpenAI compatible API for TensorRT LLM triton backend☆148Updated last month
- The Triton backend for the ONNX Runtime.☆122Updated this week
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆144Updated this week
- GPTQ inference Triton kernel☆273Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated this week
- ☆110Updated 4 months ago
- Comparison of Language Model Inference Engines☆178Updated 2 weeks ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆231Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆87Updated last year
- Benchmark suite for LLMs from Fireworks.ai☆51Updated this week
- ☆411Updated 10 months ago
- The Triton backend for the PyTorch TorchScript models.☆117Updated last week
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆173Updated 3 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆240Updated 2 weeks ago
- ☆50Updated 3 months ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆205Updated this week
- Experiments on speculative sampling with Llama models☆114Updated last year
- Scalable PaLM implementation of PyTorch☆191Updated last year
- An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).☆90Updated this week
- ring-attention experiments☆89Updated 5 months ago
- ☆42Updated this week
- Transformer related optimization, including BERT, GPT☆58Updated 11 months ago