Aleph-Alpha / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts:
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆64Updated 3 months ago
- Transformers components but in Triton☆31Updated 2 months ago
- ☆18Updated this week
- Sequence-level 1F1B schedule for LLMs.☆17Updated 8 months ago
- 🔥 A minimal training framework for scaling FLA models☆55Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 5 months ago
- Awesome Triton Resources☆19Updated 2 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆40Updated 2 months ago
- ☆30Updated 8 months ago
- Vocabulary Parallelism☆16Updated 3 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆56Updated 4 months ago
- TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.☆53Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆61Updated 2 weeks ago
- GPU operators for sparse tensor operations☆30Updated 11 months ago
- GPTQ inference TVM kernel☆38Updated 9 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- Linear Attention Sequence Parallelism (LASP)☆77Updated 8 months ago
- ☆38Updated last year
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆38Updated 3 months ago
- ☆59Updated last week
- 16-fold memory access reduction with nearly no loss☆76Updated 3 months ago
- ☆65Updated last week
- ☆47Updated last year
- ☆81Updated 5 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆196Updated 2 months ago
- An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.☆24Updated 11 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- Quantized Attention on GPU☆34Updated 2 months ago