microsoft / glinthawkLinks
An LLM inference engine, written in C++
☆18Updated 2 weeks ago
Alternatives and similar repositories for glinthawk
Users that are interested in glinthawk are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆141Updated last year
- ☆48Updated last year
- Accepted to MLSys 2026☆70Updated last week
- LLM Serving Performance Evaluation Harness☆83Updated 11 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆53Updated 6 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Updated 7 months ago
- ☆36Updated 11 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆32Updated last year
- KV cache compression for high-throughput LLM inference☆151Updated last year
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆87Updated last week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation☆19Updated 7 months ago
- Easy, Fast, and Scalable Multimodal AI☆109Updated this week
- 16-fold memory access reduction with nearly no loss☆110Updated 10 months ago
- PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]☆48Updated 3 months ago
- DeeperGEMM: crazy optimized version☆73Updated 9 months ago
- Prototyp MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism☆26Updated 10 months ago
- Preview Code for Continuum Paper☆32Updated last week
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆116Updated 2 months ago
- Compression for Foundation Models☆35Updated 6 months ago
- Fast and memory-efficient exact attention☆18Updated 2 weeks ago
- Fast and memory-efficient exact attention☆75Updated 11 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆69Updated last year
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆63Updated 3 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆260Updated last year
- AI-Driven Research Systems (ADRS)☆117Updated last month
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆111Updated last month
- A curated list for Efficient Large Language Models☆11Updated last year
- Debug print operator for cudagraph debugging☆14Updated last year