OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
β77Updated 8 months ago
Alternatives and similar repositories for LASP:
Users that are interested in LASP are comparing it to the libraries listed below
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719β22Updated 8 months ago
- DPO, but faster πβ33Updated 2 months ago
- β49Updated 7 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.β44Updated last year
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear atβ¦β99Updated 8 months ago
- β45Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Promptsβ38Updated 11 months ago
- Here we will test various linear attention designs.β58Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelismβ64Updated 7 months ago
- π₯ A minimal training framework for scaling FLA modelsβ55Updated this week
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Modelsβ43Updated last week
- Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)β60Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundryβ40Updated last year
- β30Updated 8 months ago
- [ICML'24] The official implementation of βRethinking Optimization and Architecture for Tiny Language Modelsββ120Updated last month
- Cascade Speculative Draftingβ28Updated 10 months ago
- β82Updated 4 months ago
- β22Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsityβ64Updated 5 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMsβ77Updated 2 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β84Updated 4 months ago
- Distributed IO-aware Attention algorithmβ18Updated 5 months ago
- Fast and memory-efficient exact attentionβ59Updated this week
- β64Updated 10 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)β68Updated 10 months ago
- Triton Implementation of HyperAttention Algorithmβ46Updated last year
- β18Updated this week
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"β59Updated 4 months ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".β57Updated 3 months ago