microsoft / glinthawkLinks
An LLM inference engine, written in C++
☆18Updated 2 weeks ago
Alternatives and similar repositories for glinthawk
Users that are interested in glinthawk are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆141Updated last year
- Easy, Fast, and Scalable Multimodal AI☆106Updated last week
- LLM Serving Performance Evaluation Harness☆83Updated 11 months ago
- ☆48Updated last year
- AI-Driven Research Systems (ADRS)☆117Updated last month
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆53Updated 6 months ago
- KV cache compression for high-throughput LLM inference☆151Updated last year
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Updated 7 months ago
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆116Updated 2 months ago
- ☆36Updated 11 months ago
- Fast and memory-efficient exact attention☆18Updated 2 weeks ago
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆63Updated 3 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Updated last year
- Preview Code for Continuum Paper☆32Updated last week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆85Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆176Updated last year
- Accepted to MLSys 2026☆70Updated last week
- ☆158Updated 11 months ago
- ☆47Updated 9 months ago
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆122Updated last month
- Vortex: A Flexible and Efficient Sparse Attention Framework☆45Updated 2 weeks ago
- DeeperGEMM: crazy optimized version☆73Updated 9 months ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆93Updated 2 years ago
- ☆117Updated 8 months ago
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆495Updated 2 months ago
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Updated 3 weeks ago
- Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation☆19Updated 7 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆20Updated last year
- ☆132Updated 8 months ago
- ☆22Updated 10 months ago