Aleph-Alpha / NeurIPS-WANT-submission-efficient-parallelization-layouts

☆22

Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts:

Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below

feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆68Updated 10 months ago
dame-cell / Triformer
Transformers components but in Triton
☆32Updated last month
exists-forall / striped_attention
☆38Updated last year
Raincleared-Song / sparse_gpu_operator
GPU operators for sparse tensor operations
☆32Updated last year
MayDomine / Seq1F1B
Sequence-level 1F1B schedule for LLMs.
☆17Updated 10 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆81Updated 5 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆38Updated last year
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆26Updated 3 weeks ago
BBuf / flash-rwkv
☆30Updated 10 months ago
linxihui / dkernel
☆20Updated last week
sail-sg / VocabularyParallelism
Vocabulary Parallelism
☆17Updated last month
feifeibear / ChituAttention
Quantized Attention on GPU
☆45Updated 5 months ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 3 years ago
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆116Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆68Updated 3 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆72Updated 7 months ago
stanford-futuredata / stk
☆103Updated 7 months ago
raymin0223 / fast_robust_early_exit
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
☆57Updated 6 months ago
microsoft / chunk-attention
☆69Updated last week
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆56Updated last year
sustcsonglin / fla-tilelang
☆19Updated last month
sustcsonglin / mamba-triton
☆46Updated last year
tanyuqian / redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
☆64Updated 4 months ago
juvi21 / CoPE-cuda
Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719
☆22Updated 10 months ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆36Updated last month
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆111Updated this week
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆35Updated last week
SqueezeAILab / SqueezedAttention
SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference
☆46Updated 5 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆90Updated 3 weeks ago
thu-pacman / FasterMoE
☆82Updated 3 years ago