triton-inference-server / fastertransformer_backendLinks

☆413

Alternatives and similar repositories for fastertransformer_backend

Users that are interested in fastertransformer_backend are comparing it to the libraries listed below

Sorting:

huggingface / transformers-bloom-inference
Fast Inference Solutions for BLOOM
☆564Updated last year
bytedance / effective_transformer
Running BERT without Padding
☆475Updated 3 years ago
hpcaitech / EnergonAI
Large-scale model inference.
☆629Updated 2 years ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆313Updated 2 years ago
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆905Updated last week
triton-inference-server / model_analyzer
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…
☆496Updated last week
triton-inference-server / backend
Common source, scripts and utilities for creating Triton backends.
☆357Updated last week
anyscale / llm-continuous-batching-benchmarks
☆121Updated last year
triton-inference-server / onnxruntime_backend
The Triton backend for the ONNX Runtime.
☆166Updated last week
neuralmagic / AutoFP8
☆205Updated 6 months ago
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,116Updated last year
triton-inference-server / python_backend
Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.
☆654Updated this week
triton-inference-server / client
Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
☆657Updated last week
Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆626Updated last year
triton-inference-server / vllm_backend
☆309Updated last week
bigscience-workshop / Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆1,427Updated last year
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆946Updated last year
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆317Updated last month
volcengine / veGiantModel
☆219Updated 2 years ago
Vahe1994 / SpQR
☆548Updated 11 months ago
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆479Updated last year
hpcaitech / PaLM-colossalai
Scalable PaLM implementation of PyTorch
☆188Updated 2 years ago
void-main / fastertransformer_backend
☆22Updated 2 years ago
void-main / FasterTransformer
Transformer related optimization, including BERT, GPT
☆59Updated 2 years ago
NVIDIA / NeMo-Framework-Launcher
Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
☆508Updated 6 months ago
vectorch-ai / ScaleLLM
A high-performance inference system for large language models, designed for production environments.
☆482Updated last week
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 4 months ago
tpoisonooo / llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
☆367Updated 2 years ago
microsoft / batch-inference
Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.
☆102Updated last year
bigscience-workshop / data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆316Updated 2 years ago