BBuf / flash-rwkv
☆30Updated 9 months ago
Alternatives and similar repositories for flash-rwkv:
Users that are interested in flash-rwkv are comparing it to the libraries listed below
- ☆22Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 9 months ago
- Triton implement of bi-directional (non-causal) linear attention☆43Updated last month
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated 9 months ago
- continous batching and parallel acceleration for RWKV6☆24Updated 8 months ago
- 🔥 A minimal training framework for scaling FLA models☆79Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- Quantized Attention on GPU☆45Updated 3 months ago
- DPO, but faster 🚀☆40Updated 3 months ago
- Here we will test various linear attention designs.☆60Updated 10 months ago
- Linear Attention Sequence Parallelism (LASP)☆79Updated 9 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated last month
- Awesome Triton Resources☆20Updated 3 months ago
- Transformers components but in Triton☆32Updated 4 months ago
- ☆46Updated last year
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆46Updated last year
- ☆101Updated last year
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆27Updated 3 months ago
- Using FlexAttention to compute attention with different masking patterns☆42Updated 5 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- ☆52Updated 8 months ago
- A large-scale RWKV v6, v7(World, ARWKV) inference. Capable of inference by combining multiple states(Pseudo MoE). Easy to deploy on docke…☆31Updated 3 weeks ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆43Updated 7 months ago