wdlctc / headinferLinks

☆58

Alternatives and similar repositories for headinfer

Users that are interested in headinfer are comparing it to the libraries listed below

Sorting:

snu-mllab / KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆114Updated last week
woct0rdho / transformers-qwen3-moe-fused
Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth
☆188Updated this week
OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
☆182Updated this week
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆196Updated 4 months ago
hao-ai-lab / LookaheadReasoning
[NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning
☆44Updated 2 weeks ago
zenrran4nlp / Awesome-LLM-Inference-Serving
☆43Updated 5 months ago
Cornell-RelaxML / yaqa-quantization
☆60Updated 3 months ago
kyleliang919 / Super_Muon
☆64Updated 6 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆141Updated 8 months ago
gabrielolympie / moe-pruner
A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size
☆69Updated last month
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
bentoml / BentoLMDeploy
Self-host LLMs with LMDeploy and BentoML
☆21Updated 3 months ago
Cornell-RelaxML / qtip
☆152Updated 3 months ago
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆90Updated 4 months ago
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
StigLidu / DualDistill
[EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"
☆100Updated last month
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆97Updated 10 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated this week
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated this week
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
wdlctc / mini-s
☆52Updated 11 months ago
IST-DASLab / Quartet
☆100Updated last month
ZihanWang314 / coeCheck
☆19Updated 7 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆124Updated 6 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
eqimp / hogwild_llm
Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
☆125Updated last month
pangu-tech / pangu-ultra
☆71Updated 4 months ago