InfiniTensor / ninetoothed
A domain-specific language (DSL) based on Triton but providing higher-level abstractions.
☆18Updated last week
Alternatives and similar repositories for ninetoothed:
Users that are interested in ninetoothed are comparing it to the libraries listed below
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing. By pro…☆67Updated this week
- Canvas: End-to-End Kernel Architecture Search in Neural Networks☆26Updated 3 months ago
- ☆100Updated last week
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆19Updated last month
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 10 months ago
- ☆24Updated 2 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆144Updated 8 months ago
- ☆52Updated 9 months ago
- GPTQ inference TVM kernel☆39Updated 10 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆176Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆25Updated this week
- Stateful LLM Serving☆46Updated this week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆149Updated 5 months ago
- ☆87Updated 4 months ago
- The driver for LMCache core to run in vLLM☆32Updated last month
- ☆226Updated last month
- ☆19Updated 5 months ago
- Fast and memory-efficient exact attention☆49Updated this week
- Implement Flash Attention using Cute.☆71Updated 2 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆78Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆67Updated 6 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆29Updated last week
- An IR for efficiently simulating distributed ML computation.☆28Updated last year
- High performance Transformer implementation in C++.☆105Updated last month
- ☆52Updated 11 months ago
- ☆87Updated 6 months ago
- Artifacts of EVT ASPLOS'24☆23Updated last year
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆112Updated last year