AlibabaResearch/flash-llm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/AlibabaResearch/flash-llm)

AlibabaResearch / flash-llm

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

☆246

Alternatives and similar repositories for flash-llm

Users that are interested in flash-llm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

microsoft / SparTA
View on GitHub
☆167Jul 22, 2024Updated 2 years ago
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆68Mar 25, 2025Updated last year
HPMLL / SpInfer_EuroSys25
View on GitHub
☆35Apr 2, 2025Updated last year
LeiWang1999 / AutoGPTQ.tvm
View on GitHub
GPTQ inference TVM kernel
☆41Apr 25, 2024Updated 2 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
pku-liang / MAGIS
View on GitHub
MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)
☆57May 29, 2024Updated 2 years ago
PanZaifeng / RecFlex
View on GitHub
A recommendation model kernel optimizing system
☆12Jun 5, 2025Updated last year
google-research / sputnik
View on GitHub
A library of GPU kernels for sparse matrix operations.
☆289Nov 24, 2020Updated 5 years ago
LeiWang1999 / Stream-k.tvm
View on GitHub
☆20Sep 28, 2024Updated last year
FMInference / DejaVu
View on GitHub
☆359Apr 2, 2024Updated 2 years ago
AlibabaResearch / recom
View on GitHub
An Optimizing Compiler for Recommendation Model Inference
☆26Jun 5, 2025Updated last year
aoli-al / HFuse
View on GitHub
Horizontal Fusion
☆24Jan 7, 2022Updated 4 years ago
FMInference / H2O
View on GitHub
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆530Aug 1, 2024Updated last year
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
uwsampl / SparseTIR
View on GitHub
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆145Mar 31, 2023Updated 3 years ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆512Jan 8, 2026Updated 6 months ago
LeiWang1999 / tvm_gpu_gemm
View on GitHub
play gemm with tvm
☆91Jul 22, 2023Updated 3 years ago
jiazhihao / attention_superoptimizer
View on GitHub
An Attention Superoptimizer
☆22Jan 20, 2025Updated last year
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,344Aug 28, 2025Updated 10 months ago
YukeWang96 / TC-GNN_ATC23
View on GitHub
Artifact for USENIX ATC'23: TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs.
☆58Oct 16, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
stanford-futuredata / stk
View on GitHub
☆113Aug 26, 2024Updated last year
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆504Updated this week
microsoft / mscclpp
View on GitHub
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆542Updated this week
llumnix-project / llumnix-ray
View on GitHub
Efficient and easy multi-instance LLM serving
☆563Mar 12, 2026Updated 4 months ago
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆769Aug 6, 2025Updated 11 months ago
AlibabaResearch / mononn
View on GitHub
☆32Jul 17, 2024Updated 2 years ago
LLMServe / DistServe
View on GitHub
Disaggregated serving system for Large Language Models (LLMs).
☆826Apr 6, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆47Jun 11, 2025Updated last year
apuaaChen / vectorSparse
View on GitHub
☆32Aug 24, 2022Updated 3 years ago
UDC-GAC / venom
View on GitHub
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆62Nov 24, 2023Updated 2 years ago
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 3 years ago
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,002Sep 19, 2024Updated last year
AmadeusChan / Awesome-LLM-System-Papers
View on GitHub
☆646Jan 14, 2026Updated 6 months ago
efeslab / fiddler
View on GitHub
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆267Nov 18, 2024Updated last year