friendliai / friendli-model-optimizerLinks

FMO (Friendli Model Optimizer)

☆13

Alternatives and similar repositories for friendli-model-optimizer

Users that are interested in friendli-model-optimizer are comparing it to the libraries listed below

Sorting:

friendliai / LLMServingPerfEvaluator
☆48Updated last year
friendliai / friendli-client
[⛔️ DEPRECATED] Friendli: the fastest serving engine for generative AI
☆49Updated 4 months ago
friendliai / periflow-cli
Welcome to PeriFlow CLI ☁︎
☆12Updated 2 years ago
friendliai / FAI-Model
FriendliAI Model Hub
☆91Updated 3 years ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
swsnu / aisys2023
☆103Updated 2 years ago
kakaobrain / trident
A performance library for machine learning applications.
☆184Updated 2 years ago
mlsys-seo / ooo-backprop
☆25Updated 2 years ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 5 months ago
swsnu / bd2018
☆24Updated 7 years ago
snuspl / nimble
Lightweight and Parallel Deep Learning Framework
☆263Updated 2 years ago
snuspl / pluto
MIST: High-performance IoT Stream Processing
☆18Updated 6 years ago
junstar92 / nvidia-libraries-study
☆56Updated last year
SqueezeBits / Torch-TRTLLM
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆50Updated 4 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆436Updated 5 months ago
ai-dynamo / aiperf
AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solu…
☆51Updated last week
swsnu / aisys2021
☆15Updated 4 years ago
VIA-Research / vTrain
☆73Updated 5 months ago
vedantroy / gpu_kernels
☆27Updated last year
vllm-project / tpu-inference
TPU inference for vLLM, with unified JAX and PyTorch support.
☆163Updated this week
perplexityai / pplx-garden
Perplexity open source garden for inference technology
☆232Updated this week
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆445Updated last month
SqueezeBits / owlite
OwLite is a low-code AI model compression toolkit for AI models.
☆50Updated last week
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆300Updated this week
HabanaAI / vllm-fork
A high-throughput and memory-efficient inference and serving engine for LLMs
☆85Updated this week
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
Azure / msccl
Microsoft Collective Communication Library
☆66Updated last year
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆70Updated this week
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated 3 weeks ago
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 8 months ago