triton-inference-server / fastertransformer_backend
☆411Updated 11 months ago
Related projects ⓘ
Alternatives and complementary repositories for fastertransformer_backend
- GPTQ inference Triton kernel☆283Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆611Updated 2 months ago
- Fast Inference Solutions for BLOOM☆560Updated last month
- Running BERT without Padding☆460Updated 2 years ago
- The Triton TensorRT-LLM Backend☆703Updated this week
- ☆189Updated this week
- Common source, scripts and utilities for creating Triton backends.☆293Updated this week
- Serving multiple LoRA finetuned LLM as one☆979Updated 6 months ago
- ☆109Updated 7 months ago
- Microsoft Automatic Mixed Precision Library☆522Updated last month
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆426Updated this week
- ☆527Updated 9 months ago
- Easy and Efficient Quantization for Transformers☆178Updated 3 months ago
- Universal cross-platform tokenizers binding to HF and sentencepiece☆273Updated 2 months ago
- LLaMa/RWKV onnx models, quantization and testcase☆350Updated last year
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆457Updated 7 months ago
- A throughput-oriented high-performance serving framework for LLMs☆629Updated last month
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆305Updated last year
- The Triton backend for the ONNX Runtime.☆129Updated this week
- Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.☆544Updated this week
- ☆156Updated last month
- FlashInfer: Kernel Library for LLM Serving☆1,395Updated this week
- Large-scale model inference.☆630Updated last year
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆642Updated 2 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,335Updated 7 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆352Updated 5 months ago
- Scalable PaLM implementation of PyTorch☆192Updated last year
- Comparison of Language Model Inference Engines☆189Updated 2 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,143Updated 3 weeks ago
- Pipeline Parallelism for PyTorch☆725Updated 2 months ago