punica-ai / punicaLinks

Serving multiple LoRA finetuned LLM as one

☆1,101

Alternatives and similar repositories for punica

Users that are interested in punica are comparing it to the libraries listed below

Sorting:

S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,859Updated last year
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,288Updated 7 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
apoorvumang / prompt-lookup-decoding
☆572Updated last year
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆904Updated last month
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,141Updated 3 weeks ago
vectorch-ai / ScaleLLM
A high-performance inference system for large language models, designed for production environments.
☆479Updated 2 weeks ago
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆883Updated last month
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆704Updated last year
Cornell-RelaxML / quip-sharp
☆559Updated 11 months ago
Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆626Updated last year
Vahe1994 / SpQR
☆546Updated 10 months ago
deepspeedai / DeepSpeed-MII
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,067Updated 3 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
IST-DASLab / sparsegpt
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
☆843Updated last year
tomaarsen / attention_sinks
Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
☆722Updated last year
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,642Updated last year
mlc-ai / xgrammar
Fast, Flexible and Portable Structured Generation
☆1,309Updated last week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆231Updated 10 months ago
haoliuhl / ringattention
Large Context Attention
☆743Updated last week
jzhang38 / EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆747Updated last year
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆901Updated this week
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,619Updated last year
datamllab / LongLM
[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
☆662Updated last year
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆359Updated 8 months ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆446Updated 9 months ago
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆859Updated 5 months ago
princeton-nlp / LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆631Updated last year
sabetAI / BLoRA
batched loras
☆346Updated 2 years ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆310Updated 2 years ago