wdlctc / headinferLinks
☆58Updated 4 months ago
Alternatives and similar repositories for headinfer
Users that are interested in headinfer are comparing it to the libraries listed below
Sorting:
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆114Updated last week
- Fused Qwen3 MoE layer for faster training, compatible with HF Transformers, LoRA, 4-bit quant, Unsloth☆188Updated this week
- A collection of tricks and tools to speed up transformer models☆182Updated this week
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.☆196Updated 4 months ago
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆44Updated 2 weeks ago
- ☆43Updated 5 months ago
- ☆60Updated 3 months ago
- ☆64Updated 6 months ago
- KV cache compression for high-throughput LLM inference☆141Updated 8 months ago
- A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size☆69Updated last month
- QuIP quantization☆59Updated last year
- Self-host LLMs with LMDeploy and BentoML☆21Updated 3 months ago
- ☆152Updated 3 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆90Updated 4 months ago
- Work in progress.☆74Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆130Updated 10 months ago
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆100Updated last month
- RWKV-7: Surpassing GPT☆97Updated 10 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated this week
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated this week
- The official repo for "LLoCo: Learning Long Contexts Offline"☆117Updated last year
- ☆52Updated 11 months ago
- ☆100Updated last month
- ☆19Updated 7 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- A repository for research on medium sized language models.☆78Updated last year
- LLM Inference on consumer devices☆124Updated 6 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆130Updated 10 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆125Updated last month
- ☆71Updated 4 months ago