mlc-ai / relaxLinks

☆167

Alternatives and similar repositories for relax

Users that are interested in relax are comparing it to the libraries listed below

Sorting:

tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆266Updated 3 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆117Updated last year
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 9 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated last month
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
mlc-ai / llm-perf-bench
☆120Updated last year
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆80Updated last week
OpenPPL / ppl.llm.kernel.cuda
☆150Updated 9 months ago
OpenPPL / ppl.llm.serving
☆129Updated 10 months ago
mlc-ai / notebooks
☆210Updated 11 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆223Updated 2 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 7 months ago
OpenPPL / ppl.nn.llm
☆139Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 8 months ago
mlc-ai / mlc-python
☆38Updated 3 months ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆102Updated 7 years ago
InternLM / turbomind
☆97Updated 7 months ago
FlagTree / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆93Updated this week
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆154Updated this week
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆96Updated this week
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆698Updated 2 months ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆311Updated 2 years ago
bytedance / byteir
A model compilation solution for various hardware
☆451Updated 2 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆121Updated 5 months ago
BBuf / tensorrt-llm-moe
☆33Updated 8 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆212Updated this week
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆265Updated 2 months ago