Aleph-Alpha / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts:
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 10 months ago
- Transformers components but in Triton☆32Updated last month
- ☆38Updated last year
- GPU operators for sparse tensor operations☆32Updated last year
- Sequence-level 1F1B schedule for LLMs.☆17Updated 10 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆81Updated 5 months ago
- GPTQ inference TVM kernel☆38Updated last year
- Awesome Triton Resources☆26Updated 3 weeks ago
- ☆30Updated 10 months ago
- ☆20Updated last week
- Vocabulary Parallelism☆17Updated last month
- Quantized Attention on GPU☆45Updated 5 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 3 years ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆116Updated last year
- ☆68Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- ☆103Updated 7 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆57Updated 6 months ago
- ☆69Updated last week
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- ☆19Updated last month
- ☆46Updated last year
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆64Updated 4 months ago
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated 10 months ago
- A simple calculation for LLM MFU.☆36Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆111Updated this week
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆35Updated last week
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆46Updated 5 months ago
- 16-fold memory access reduction with nearly no loss☆90Updated 3 weeks ago
- ☆82Updated 3 years ago