OpenBMB / infllmv2_cuda_implLinks

☆72

Alternatives and similar repositories for infllmv2_cuda_impl

Users that are interested in infllmv2_cuda_impl are comparing it to the libraries listed below

Sorting:

ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆152Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆245Updated 3 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆106Updated 7 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆169Updated last month
FasterDecoding / TEAL
☆147Updated 9 months ago
microsoft / AttentionEngine
☆106Updated 5 months ago
ByteDance-Seed / cudaLLM
☆120Updated 2 months ago
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆247Updated 5 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆245Updated 4 months ago
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆111Updated 7 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆127Updated 5 months ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆26Updated 10 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
microsoft / chunk-attention
☆81Updated 6 months ago
MayDomine / Burst-Attention
Distributed IO-aware Attention algorithm
☆21Updated last month
OpenSparseLLMs / Linear-MoE
☆120Updated 5 months ago
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆366Updated 9 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆346Updated 4 months ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆269Updated last week
madsys-dev / deepseekv2-profile
☆151Updated 8 months ago
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆83Updated last month
FasterDecoding / SnapKV
☆287Updated 4 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆145Updated 2 months ago
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆128Updated 2 weeks ago
kyegomez / FlashAttention20Triton
Triton implementation of Flash Attention2.0
☆43Updated 2 years ago