mlc-ai / mlc-en
☆411Updated 6 months ago
Alternatives and similar repositories for mlc-en:
Users that are interested in mlc-en are comparing it to the libraries listed below
- ☆205Updated 4 months ago
- An open-source efficient deep learning framework/compiler, written in python.☆698Updated last month
- ☆607Updated 10 months ago
- Distributed Triton for Parallel Systems☆415Updated last week
- Applied AI experiments and examples for PyTorch☆261Updated 3 weeks ago
- Dive into Deep Learning Compiler☆646Updated 2 years ago
- Fast low-bit matmul kernels in Triton☆288Updated this week
- Cataloging released Triton kernels.☆217Updated 3 months ago
- GPTQ inference Triton kernel☆299Updated last year
- A curated list of awesome projects and papers for distributed training or inference☆231Updated 6 months ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆472Updated last year
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆304Updated 9 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆262Updated 10 months ago
- A library to analyze PyTorch traces.☆366Updated last week
- A Easy-to-understand TensorOp Matmul Tutorial☆342Updated 6 months ago
- Step-by-step optimization of CUDA SGEMM☆308Updated 3 years ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆204Updated last year
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆587Updated 2 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆639Updated last month
- A schedule language for large model training☆146Updated 10 months ago
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆156Updated 4 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆403Updated last month
- ☆192Updated 2 years ago
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.☆1,040Updated last year
- Materials for learning SGLang☆379Updated 3 weeks ago
- A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.☆981Updated 7 months ago
- ☆199Updated last week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆268Updated this week
- An experimental CPU backend for Triton☆105Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆803Updated 7 months ago