tile-ai / TileAttentionLinks

☆39

Alternatives and similar repositories for TileAttention

Users that are interested in TileAttention are comparing it to the libraries listed below

Sorting:

li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆41Updated last month
tile-ai / AttentionEngine
☆49Updated last month
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 7 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
microsoft / AttentionEngine
☆71Updated last month
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 7 months ago
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
BBuf / flash-rwkv
☆31Updated last year
yanring / Megatron-MoE-ModelZoo
Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.
☆21Updated 2 weeks ago
sail-sg / VocabularyParallelism
Vocabulary Parallelism
☆19Updated 3 months ago
LeiWang1999 / Stream-k.tvm
☆19Updated 8 months ago
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated this week
dame-cell / Triformer
Transformers components but in Triton
☆34Updated last month
Dao-AILab / gemm-cublas
☆21Updated last month
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆26Updated 8 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆47Updated 11 months ago
sustcsonglin / fla-tilelang
☆21Updated 3 months ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆38Updated 3 months ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 4 years ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
cassiewilliam / cuda_op_benchmark
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆15Updated last year
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆23Updated this week
GeeeekExplorer / transformers-patch
patches for huggingface transformers to save memory
☆23Updated 3 weeks ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated last month
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆24Updated 5 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated 2 weeks ago
exists-forall / striped_attention
☆39Updated last year