FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
☆342Updated this week
Related projects ⓘ
Alternatives and complementary repositories for FlagGems
- A collection of memory efficient attention operators implemented in the Triton language.☆219Updated 5 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆290Updated 2 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆202Updated 5 months ago
- ☆140Updated 6 months ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆224Updated 3 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆238Updated last week
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆302Updated 2 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆359Updated 3 months ago
- ☆138Updated 2 weeks ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆443Updated last week
- Development repository for the Triton-Linalg conversion☆151Updated last month
- ☆167Updated 4 months ago
- ☆79Updated 2 months ago
- A low-latency & high-throughput serving engine for LLMs☆245Updated 2 months ago
- ☆57Updated 2 weeks ago
- ☆123Updated 2 weeks ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆311Updated 2 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆29Updated 2 months ago
- [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…☆322Updated this week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆420Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆203Updated last week
- Shared Middle-Layer for Triton Compilation☆191Updated this week
- ☆79Updated 8 months ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆457Updated 8 months ago
- Yinghan's Code Sample☆289Updated 2 years ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆208Updated 3 weeks ago
- learning how CUDA works☆169Updated 3 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆142Updated 2 years ago
- A simple high performance CUDA GEMM implementation.☆335Updated 10 months ago
- A model compilation solution for various hardware☆378Updated last week