kyegomez / Blockwise-Parallel-TransformerLinks
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
Alternatives and similar repositories for Blockwise-Parallel-Transformer
Users that are interested in Blockwise-Parallel-Transformer are comparing it to the libraries listed below
Sorting:
- Linear Attention Sequence Parallelism (LASP)☆85Updated last year
- Triton Implementation of HyperAttention Algorithm☆48Updated last year
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆98Updated 9 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆31Updated last year
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆38Updated last month
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆101Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- ☆105Updated last year
- Here we will test various linear attention designs.☆60Updated last year
- ☆31Updated last year
- Sparse Backpropagation for Mixture-of-Expert Training☆29Updated last year
- DPO, but faster 🚀☆43Updated 7 months ago
- Repository for CPU Kernel Generation for LLM Inference☆26Updated 2 years ago
- ☆55Updated last year
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated last year
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆28Updated last year
- ☆47Updated last month
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated last year
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆42Updated last week
- Official code for the paper "Attention as a Hypernetwork"☆40Updated last year
- Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)☆42Updated last year
- ☆51Updated last year
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆35Updated last year
- ☆48Updated last year
- ☆29Updated 2 years ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 6 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"☆18Updated last month
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆24Updated last year