fla-org / flash-bidirectional-linear-attention
Triton implement of bi-directional (non-causal) linear attention
β29Updated last week
Alternatives and similar repositories for flash-bidirectional-linear-attention:
Users that are interested in flash-bidirectional-linear-attention are comparing it to the libraries listed below
- π₯ A minimal training framework for scaling FLA modelsβ12Updated this week
- β30Updated 7 months ago
- HGRN2: Gated Linear RNNs with State Expansionβ52Updated 4 months ago
- Here we will test various linear attention designs.β58Updated 8 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Modelsβ28Updated 6 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)β24Updated 7 months ago
- Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ62Updated last week
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".β18Updated last month
- RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the bestβ¦β21Updated 9 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin toβ¦β21Updated last week
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retentiβ¦β60Updated 8 months ago
- β48Updated last week
- β22Updated last year
- β25Updated 2 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inferenceβ29Updated 6 months ago
- Official code for the paper "Attention as a Hypernetwork"β23Updated 6 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).β21Updated 6 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inferenceβ35Updated last month
- A repository for DenseSSMsβ87Updated 8 months ago
- β99Updated 10 months ago
- Transformers components but in Tritonβ29Updated last month
- Beyond KV Caching: Shared Attention for Efficient LLMsβ13Updated 5 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMsβ76Updated last month
- β18Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.β42Updated 2 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Modelsβ55Updated 2 months ago
- PyTorch implementation of StableMask (ICML'24)β12Updated 6 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"β33Updated 3 months ago
- Code for paper "Patch-Level Training for Large Language Models"β74Updated last month