microsoft / glinthawk
An LLM inference engine, written in C++
☆12Updated 2 months ago
Alternatives and similar repositories for glinthawk:
Users that are interested in glinthawk are comparing it to the libraries listed below
- FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …☆24Updated 3 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- A minimal implementation of vllm.☆37Updated 8 months ago
- A curated list for Efficient Large Language Models☆11Updated last year
- TensorRT LLM Benchmark Configuration☆13Updated 8 months ago
- Official Implementation of "CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks"☆17Updated 3 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 5 months ago
- A resilient distributed training framework☆91Updated 11 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆31Updated 3 weeks ago
- How much energy do GenAI models consume?☆42Updated 5 months ago
- ☆12Updated 10 months ago
- Stateful LLM Serving☆50Updated 2 weeks ago
- An Attention Superoptimizer☆21Updated 2 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆28Updated 4 months ago
- ☆46Updated 8 months ago
- Official Repo for "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization"☆29Updated last year
- Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]☆22Updated 4 months ago
- ☆11Updated 7 months ago
- ☆45Updated 9 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆15Updated 2 months ago
- ☆26Updated last week
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆32Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- ☆20Updated this week
- Exploration of Vector database Index for fast approximate nearest neighbour search.☆21Updated 7 months ago
- ❓Curie: Automated and Rigorous Scientific Experimentation with AI Agents☆54Updated this week
- Compression for Foundation Models☆27Updated this week
- ☆11Updated 3 years ago
- Personal solutions to the Triton Puzzles☆18Updated 8 months ago