microsoft / glinthawkLinks

An LLM inference engine, written in C++

☆15

Alternatives and similar repositories for glinthawk

Users that are interested in glinthawk are comparing it to the libraries listed below

Sorting:

microsoft / FractalTensor
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆26Updated 6 months ago
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆30Updated 7 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆13Updated this week
tyler-griggs / melange-release
☆47Updated last year
Ying1123 / VTC-artifact
☆32Updated last year
eth-easl / deltazip
Compression for Foundation Models
☆32Updated 3 months ago
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆19Updated 11 months ago
eth-easl / mixtera
A lightweight, user-friendly data-plane for LLM training.
☆19Updated 2 months ago
SymbioticLab / Oobleck
A resilient distributed training framework
☆95Updated last year
MDK8888 / vllmini
A minimal implementation of vllm.
☆44Updated 11 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆116Updated 6 months ago
hao-ai-lab / MuxServe
☆62Updated last year
gau-nernst / quantized-training
Explore training for quantized models
☆18Updated last week
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆96Updated 7 months ago
mlc-ai / mlc-python
☆35Updated last month
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆42Updated last month
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆163Updated 9 months ago
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆39Updated last week
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆50Updated last month
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆39Updated 2 months ago
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆78Updated 4 months ago
UChi-JCL / CacheGen
☆109Updated 8 months ago
tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated this week
alexzhang13 / Triton-Puzzles-Solutions
Personal solutions to the Triton Puzzles
☆19Updated 11 months ago
NolanoOrg / SpectraSuite
☆48Updated 11 months ago
amazon-science / piperag
PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)
☆21Updated last year
cassiewilliam / cuda_op_benchmark
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆15Updated last year
tspeterkim / cuda-1brc
My CUDA solution to the 1BRC
☆10Updated last year
mlsys-io / kv.run
A model serving framework for various research and production scenarios. Seamlessly built upon the PyTorch and HuggingFace ecosystem.
☆23Updated 8 months ago
thomaschlt / mla.c
Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.
☆18Updated 5 months ago