TorchMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

☆132

Alternatives and similar repositories for MoE-Infinity:

Users that are interested in MoE-Infinity are comparing it to the libraries listed below

opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆157Updated 7 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆199Updated last year
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆245Updated 2 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆295Updated 7 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆234Updated 3 months ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆138Updated 7 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆290Updated this week
LoongServe / LoongServe
☆83Updated 3 months ago
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆143Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆107Updated 2 months ago
shadowpa0327 / Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
☆75Updated 2 weeks ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆64Updated 5 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆102Updated last month
RulinShao / LightSeq
Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
☆205Updated 6 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆76Updated 3 months ago
d-matrix-ai / keyformer-llm
☆52Updated 10 months ago
YaoJiayi / CacheBlend
☆77Updated last month
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆81Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆64Updated 3 months ago
FasterDecoding / SnapKV
☆220Updated 9 months ago
thu-pacman / FasterMoE
☆76Updated 2 years ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆93Updated last week
zhuohan123 / terapipe
☆72Updated 3 years ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆312Updated 3 weeks ago
madsys-dev / deepseekv2-profile
☆101Updated 6 months ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆220Updated last month
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆68Updated this week
stanford-futuredata / stk
☆100Updated 5 months ago
hao-ai-lab / MuxServe
☆50Updated 8 months ago