flagos-ai/FlagGems

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/flagos-ai/FlagGems)

flagos-ai / FlagGems

FlagGems is an operator library for large language models implemented in the Triton Language.

☆909

Alternatives and similar repositories for FlagGems

Users that are interested in FlagGems are comparing it to the libraries listed below

Sorting:

flagos-ai / FlagAttention
View on GitHub
A collection of memory efficient attention operators implemented in the Triton language.
☆288Jun 5, 2024Updated last year
Cambricon / triton-linalg
View on GitHub
Development repository for the Triton-Linalg conversion
☆215Feb 7, 2025Updated last year
microsoft / triton-shared
View on GitHub
Shared Middle-Layer for Triton Compilation
☆329Dec 5, 2025Updated 3 months ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,371Feb 13, 2026Updated 3 weeks ago
meta-pytorch / tritonbench
View on GitHub
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆327Updated this week
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,057Updated this week
tile-ai / tilelang
View on GitHub
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆5,284Updated this week
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,264Aug 28, 2025Updated 6 months ago
flagos-ai / FlagTree
View on GitHub
FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-lang…
☆214Updated this week
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆753Aug 6, 2025Updated 6 months ago
mirage-project / mirage
View on GitHub
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆2,145Feb 23, 2026Updated last week
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆409Updated this week
gpu-mode / triton-index
View on GitHub
Cataloging released Triton kernels.
☆295Sep 9, 2025Updated 5 months ago
AlibabaPAI / FLASHNN
View on GitHub
☆104Sep 9, 2024Updated last year
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆2,841Updated this week
flagos-ai / FlagPerf
View on GitHub
FlagPerf is an open-source software platform for benchmarking AI chips.
☆362Nov 11, 2025Updated 3 months ago
HazyResearch / ThunderKittens
View on GitHub
Tile primitives for speedy kernels
☆3,202Feb 24, 2026Updated last week
ModelTC / LightLLM
View on GitHub
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆3,919Updated this week
KEKE046 / mlir-tutorial
View on GitHub
Hands-On Practical MLIR Tutorial
☆724Oct 20, 2023Updated 2 years ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆79Aug 12, 2024Updated last year
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆97Sep 19, 2025Updated 5 months ago
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆194Jan 28, 2025Updated last year
BBuf / tvm_mlir_learn
View on GitHub
compiler learning resources collect.
☆2,684Mar 19, 2025Updated 11 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆262Jul 11, 2024Updated last year
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,025Sep 4, 2024Updated last year
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆490Jan 20, 2026Updated last month
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆251Feb 13, 2026Updated 3 weeks ago
triton-lang / kernels
View on GitHub
☆105Nov 7, 2024Updated last year
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆436Feb 1, 2026Updated last month
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆269Jul 1, 2025Updated 8 months ago
iclementine / optimize_softmax
View on GitHub
Optimize softmax in triton in many cases
☆23Sep 6, 2024Updated last year
flagos-ai / FlagScale
View on GitHub
FlagScale is a large model toolkit based on open-sourced projects.
☆485Updated this week
alibaba / BladeDISC
View on GitHub
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
☆917Dec 30, 2024Updated last year
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆567Nov 7, 2025Updated 3 months ago
cchan / tccl
View on GitHub
extensible collectives library in triton
☆96Mar 31, 2025Updated 11 months ago
triton-lang / triton
View on GitHub
Development repository for the Triton language and compiler
☆18,501Updated this week
thuml / depyf
View on GitHub
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
☆790Oct 13, 2025Updated 4 months ago
Deep-Learning-Profiling-Tools / triton-viz
View on GitHub
☆301Updated this week
srush / Triton-Puzzles
View on GitHub
Puzzles for learning Triton
☆2,324Nov 18, 2024Updated last year