ByteDance-Seed / cudaLLMLinks

☆125

Alternatives and similar repositories for cudaLLM

Users that are interested in cudaLLM are comparing it to the libraries listed below

Sorting:

microsoft / AttentionEngine
☆116Updated 7 months ago
flashinfer-ai / cutlass-viz
☆65Updated 8 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆100Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆84Updated 11 months ago
tile-ai / AttentionEngine
☆52Updated 7 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆74Updated 8 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆242Updated last month
dsl-learn / cutile-learn
NVIDIA cuTile learn
☆144Updated 3 weeks ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆135Updated 7 months ago
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆197Updated last week
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆141Updated 7 months ago
osayamenja / FlashMoE
Distributed MoE in a Single Kernel [NeurIPS '25]
☆171Updated this week
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆160Updated 2 months ago
InternLM / turbomind
☆96Updated 9 months ago
svg-project / flash-kmeans
Fast and memory-efficient exact kmeans
☆131Updated last month
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆53Updated last year
OpenBitSys / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆73Updated 2 weeks ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆151Updated 3 months ago
sustcsonglin / fla-tilelang
☆35Updated 9 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆91Updated 3 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆79Updated last year
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆46Updated 6 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆95Updated 3 weeks ago
Dao-AILab / grouped-latent-attention
☆133Updated 7 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 4 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆109Updated 9 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆155Updated last month
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆97Updated this week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆123Updated last year