FlagOpen / FlagGemsLinks

FlagGems is an operator library for large language models implemented in the Triton Language.

☆635

Alternatives and similar repositories for FlagGems

Users that are interested in FlagGems are comparing it to the libraries listed below

Sorting:

ByteDance-Seed / Triton-distributed
Distributed Compiler based on Triton for Parallel Systems
☆941Updated this week
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆446Updated 8 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆275Updated last year
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆398Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆369Updated 10 months ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆450Updated 10 months ago
ifromeast / cuda_learning
learning how CUDA works
☆295Updated 5 months ago
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆515Updated 2 weeks ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆214Updated last month
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆730Updated 5 months ago
Cambricon / triton-linalg
Development repository for the Triton-Linalg conversion
☆190Updated 5 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
LLMServe / DistServe
Disaggregated serving system for Large Language Models (LLMs).
☆654Updated 3 months ago
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆474Updated last year
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆654Updated 3 weeks ago
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆445Updated last year
DefTruth / CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
☆36Updated 3 months ago
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆392Updated last year
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆521Updated 10 months ago
antgroup / glake
GLake: optimizing GPU memory management and IO transmission.
☆471Updated 4 months ago
ColfaxResearch / cutlass-kernels
☆227Updated last year
OpenPPL / ppl.nn.llm
☆139Updated last year
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆340Updated 3 years ago
FlagOpen / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆333Updated this week
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆99Updated 3 weeks ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
ModelTC / LightCompress
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…
☆528Updated this week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 2 months ago
reed-lau / cute-gemm
☆128Updated 8 months ago
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆1,489Updated this week