vllm-project / tpu-inferenceLinks
TPU inference for vLLM, with unified JAX and PyTorch support.
☆97Updated this week
Alternatives and similar repositories for tpu-inference
Users that are interested in tpu-inference are comparing it to the libraries listed below
Sorting:
- extensible collectives library in triton☆89Updated 6 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆124Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆299Updated 2 months ago
- Fast low-bit matmul kernels in Triton☆381Updated 3 weeks ago
- ring-attention experiments☆154Updated last year
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆215Updated this week
- ☆92Updated 11 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆264Updated this week
- Triton-based implementation of Sparse Mixture of Experts.☆246Updated 3 weeks ago
- ☆240Updated this week
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆76Updated last month
- Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.☆111Updated last week
- Collection of kernels written in Triton language☆157Updated 6 months ago
- How to ensure correctness and ship LLM generated kernels in PyTorch☆66Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆83Updated last year
- Cataloging released Triton kernels.☆263Updated last month
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆216Updated last year
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆119Updated 3 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆270Updated 2 months ago
- Triton-based Symmetric Memory operators and examples☆38Updated this week
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆220Updated last week
- ☆112Updated last year
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆221Updated 2 years ago
- ☆141Updated 9 months ago
- Github mirror of trition-lang/triton repo.☆86Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆130Updated 10 months ago
- A Quirky Assortment of CuTe Kernels☆627Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆283Updated this week
- ☆65Updated 5 months ago
- DeeperGEMM: crazy optimized version☆72Updated 5 months ago