anyscale / llm-continuous-batching-benchmarksLinks

☆121

Alternatives and similar repositories for llm-continuous-batching-benchmarks

Users that are interested in llm-continuous-batching-benchmarks are comparing it to the libraries listed below

Sorting:

neuralmagic / AutoFP8
☆205Updated 5 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆216Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated this week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆310Updated 2 years ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
stanford-futuredata / stk
☆112Updated last year
deepspeedai / DeepSpeed-Kernels
☆72Updated 6 months ago
cli99 / llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
☆459Updated 6 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
InternLM / turbomind
☆96Updated 6 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 3 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆220Updated last week
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆58Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆154Updated last week
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
mlc-ai / llm-perf-bench
☆120Updated last year
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆282Updated last year
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
FMInference / DejaVu
☆341Updated last year
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆432Updated 5 months ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆431Updated last week
FasterDecoding / REST
REST: Retrieval-Based Speculative Decoding, NAACL 2024
☆210Updated last month
yanring / Megatron-MoE-ModelZoo
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
☆111Updated last week
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆317Updated 7 months ago