Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layoutsLinks
☆22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
Sorting:
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- ☆31Updated last year
- Transformers components but in Triton☆34Updated 2 months ago
- Awesome Triton Resources☆31Updated 2 months ago
- GPU operators for sparse tensor operations☆33Updated last year
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆118Updated last year
- ☆14Updated 2 years ago
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆66Updated 7 months ago
- [ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang☆14Updated last year
- ☆48Updated last year
- ☆116Updated last month
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆70Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated 2 years ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 10 months ago
- Linear Attention Sequence Parallelism (LASP)☆85Updated last year
- ☆20Updated 2 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated 3 weeks ago
- The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free☆44Updated 2 months ago
- Vocabulary Parallelism☆19Updated 4 months ago
- continous batching and parallel acceleration for RWKV6☆24Updated last year
- Distributed IO-aware Attention algorithm☆20Updated 10 months ago
- Here we will test various linear attention designs.☆60Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆138Updated last month
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆42Updated last week
- Sequence-level 1F1B schedule for LLMs.☆17Updated last year
- Simple and efficient pytorch-native transformer training and inference (batched)☆77Updated last year
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆107Updated 3 months ago
- ☆106Updated 10 months ago
- ☆42Updated last week