BBuf / flash-rwkv
☆30Updated 11 months ago
Alternatives and similar repositories for flash-rwkv:
Users that are interested in flash-rwkv are comparing it to the libraries listed below
- Triton implement of bi-directional (non-causal) linear attention☆47Updated 3 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last week
- ☆22Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated 10 months ago
- continous batching and parallel acceleration for RWKV6☆24Updated 10 months ago
- Here we will test various linear attention designs.☆60Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 11 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 10 months ago
- Quantized Attention on GPU☆45Updated 5 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated 2 weeks ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- ☆20Updated 2 months ago
- Linear Attention Sequence Parallelism (LASP)☆82Updated 11 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated last year
- Transformers components but in Triton☆33Updated this week
- ☆53Updated 10 months ago
- ☆18Updated this week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆100Updated this week
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆99Updated last month
- Awesome Triton Resources☆27Updated 2 weeks ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆27Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 7 months ago
- Using FlexAttention to compute attention with different masking patterns☆43Updated 7 months ago
- ☆32Updated last year
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated 11 months ago
- ☆103Updated last year
- ☆68Updated this week
- Flash-Linear-Attention models beyond language☆13Updated this week
- ☆48Updated last year