OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
β79Updated 9 months ago
Alternatives and similar repositories for LASP:
Users that are interested in LASP are comparing it to the libraries listed below
- π₯ A minimal training framework for scaling FLA modelsβ79Updated this week
- Fast and memory-efficient exact attentionβ66Updated 2 weeks ago
- Odysseus: Playground of LLM Sequence Parallelismβ66Updated 9 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Promptsβ39Updated last year
- β49Updated 4 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)β70Updated 11 months ago
- DPO, but faster πβ40Updated 3 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundryβ40Updated last year
- β52Updated 8 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β86Updated 5 months ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear atβ¦β99Updated 9 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ123Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ109Updated 3 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.β46Updated last year
- Cascade Speculative Draftingβ29Updated 11 months ago
- Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)β60Updated last year
- [ICML'24] The official implementation of βRethinking Optimization and Architecture for Tiny Language Modelsββ121Updated 2 months ago
- β101Updated last year
- β36Updated 6 months ago
- β30Updated 9 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"β36Updated last year
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719β22Updated 9 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β43Updated 5 months ago
- β139Updated last year
- β87Updated 5 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"β59Updated 5 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ68Updated 6 months ago