mirage-project / mirage
A multi-level tensor algebra superoptimizer
☆314Updated this week
Related projects: ⓘ
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆342Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆210Updated last month
- ☆124Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆452Updated last week
- Applied AI experiments and examples for PyTorch☆121Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆156Updated this week
- Cataloging released Triton kernels.☆108Updated 3 weeks ago
- Code for QuaRot, an end-to-end 4-bit inference of large language models.☆254Updated last month
- Flash Attention in ~100 lines of CUDA (forward pass only)☆558Updated 5 months ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆399Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- ☆138Updated 2 months ago
- Collection of kernels written in Triton language☆48Updated 2 weeks ago
- ☆247Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆144Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆560Updated 2 weeks ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆163Updated 3 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆173Updated 3 months ago
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- scalable and robust tree-based speculative decoding algorithm☆298Updated last month
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆451Updated last month
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference☆167Updated 5 months ago
- ring-attention experiments☆89Updated 5 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆81Updated 2 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆166Updated 3 weeks ago
- Transformers with Arbitrarily Large Context☆613Updated last month
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆339Updated 6 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆281Updated last month
- Ring attention implementation with flash attention☆529Updated this week
- Fast CUDA matrix multiplication from scratch☆420Updated 8 months ago